Pré-entraînement Conscient des Régions pour la Détection d'Objets à Vocabulaire Ouvert avec des Transformers Visuels

Résumé

Nous présentons Region-aware Open-vocabulary Vision Transformers (RO-ViT) - une méthode de pré-entraînement contrastif image-texte visant à combler l'écart entre le pré-entraînement au niveau de l'image et la détection d'objets à vocabulaire ouvert. Lors de la phase de pré-entraînement, nous proposons de recadrer et redimensionner aléatoirement des régions des embeddings positionnels au lieu d'utiliser les embeddings positionnels de l'image entière. Cela correspond mieux à l'utilisation des embeddings positionnels au niveau des régions lors de la phase de fine-tuning pour la détection. De plus, nous remplaçons la perte d'entropie croisée softmax couramment utilisée dans l'apprentissage contrastif par une perte focale, afin de mieux apprendre les exemples informatifs mais difficiles. Enfin, nous exploitons les avancées récentes en matière de propositions d'objets nouveaux pour améliorer le fine-tuning de la détection à vocabulaire ouvert. Nous évaluons notre modèle complet sur les benchmarks de détection à vocabulaire ouvert LVIS et COCO ainsi que sur le transfert zero-shot. RO-ViT atteint un état de l'art de 32,1 AP_r sur LVIS, surpassant la meilleure approche existante de +5,8 points, en plus d'obtenir des performances compétitives en détection par transfert zero-shot. Étonnamment, RO-ViT améliore également la représentation au niveau de l'image et atteint l'état de l'art sur 9 des 12 métriques des benchmarks de recherche image-texte COCO et Flickr, surpassant des approches concurrentes utilisant des modèles plus grands.

English

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

Pré-entraînement Conscient des Régions pour la Détection d'Objets à Vocabulaire Ouvert avec des Transformers Visuels

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Résumé

Support