비전 트랜스포머를 활용한 개방형 어휘 객체 탐지를 위한 지역 인식 사전 학습

초록

우리는 이미지 수준의 사전 학습과 개방형 어휘 객체 탐지 간의 격차를 해소하기 위한 대조적 이미지-텍스트 사전 학습 방법인 Region-aware Open-vocabulary Vision Transformers(RO-ViT)를 제안합니다. 사전 학습 단계에서, 전체 이미지 위치 임베딩을 사용하는 대신 위치 임베딩의 영역을 무작위로 자르고 크기를 조정하는 방식을 제안합니다. 이는 탐지 미세 조정 단계에서 영역 수준의 위치 임베딩 사용과 더 잘 맞습니다. 또한, 대조 학습에서 일반적으로 사용되는 소프트맥스 교차 엔트로피 손실을 포컬 손실로 대체하여 정보가 풍부하지만 학습하기 어려운 예제를 더 잘 학습할 수 있도록 합니다. 마지막으로, 최신 객체 제안 기술을 활용하여 개방형 어휘 탐지 미세 조정을 개선합니다. 우리는 전체 모델을 LVIS 및 COCO 개방형 어휘 탐지 벤치마크와 제로샷 전이에서 평가합니다. RO-ViT는 LVIS에서 32.1 AP_r의 최첨단 성능을 달성하며, 기존 최고 접근법보다 +5.8 포인트를 앞섰을 뿐만 아니라 경쟁력 있는 제로샷 전이 탐지 성능도 보여줍니다. 놀랍게도, RO-ViT는 이미지 수준 표현도 개선하여 COCO 및 Flickr 이미지-텍스트 검색 벤치마크에서 12개 지표 중 9개에서 최첨단 성능을 달성하며, 더 큰 모델을 사용한 경쟁 접근법들을 능가합니다.

English

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

비전 트랜스포머를 활용한 개방형 어휘 객체 탐지를 위한 지역 인식 사전 학습

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

초록

Support