Regiobewuste Pretraining voor Open-Vocabulary Objectdetectie met Vision Transformers

Samenvatting

We presenteren Region-aware Open-vocabulary Vision Transformers (RO-ViT) - een contrastief beeld-tekst vooraf trainingsrecept om de kloof te overbruggen tussen beeldniveau vooraf trainen en open-vocabulary objectdetectie. Tijdens de vooraf trainingsfase stellen we voor om willekeurig regio's van positionele embeddings bij te snijden en te herschalen in plaats van de volledige positionele embeddings van het beeld te gebruiken. Dit sluit beter aan bij het gebruik van positionele embeddings op regioniveau in de detectie fine-tuningfase. Daarnaast vervangen we het gebruikelijke softmax kruis entropie verlies in contrastief leren door focal loss om de informatieve maar moeilijke voorbeelden beter te leren. Ten slotte benutten we recente vooruitgang in nieuwe objectvoorstellen om de open-vocabulary detectie fine-tuning te verbeteren. We evalueren ons volledige model op de LVIS en COCO open-vocabulary detectie benchmarks en zero-shot transfer. RO-ViT behaalt een state-of-the-art 32.1 AP_r op LVIS, wat het beste bestaande benadering met +5.8 punten overtreft, naast competitieve zero-shot transfer detectie. Verrassend genoeg verbetert RO-ViT ook de beeldniveau representatie en behaalt het de state of the art op 9 van de 12 metrieken op COCO en Flickr beeld-tekst retrieval benchmarks, waarbij het competitieve benaderingen met grotere modellen overtreft.

English

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

Regiobewuste Pretraining voor Open-Vocabulary Objectdetectie met Vision Transformers

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Samenvatting

Support