대조적 특징 마스킹 오픈-보컬러리 비전 트랜스포머

초록

우리는 오픈-어휘 객체 탐지(OVD)를 위한 이미지 및 영역 수준 표현의 동시 학습을 달성하는 이미지-텍스트 사전 학습 방법론인 Contrastive Feature Masking Vision Transformer(CFM-ViT)를 제안한다. 우리의 접근 방식은 마스크드 오토인코더(MAE) 목표를 대조 학습 목표와 결합하여 위치 지정 작업을 위한 표현을 개선한다. 기존의 MAE와 달리, 우리는 픽셀 공간이 아닌 이미지-텍스트 임베딩 공간에서 재구성을 수행함으로써 모델이 영역 수준의 의미를 더 잘 학습하도록 한다. 또한, 우리는 Positional Embedding Dropout(PED)을 도입하여 이미지-텍스트 사전 학습과 탐지 미세 조정 간의 스케일 변동을 해결한다. PED는 사전 학습 중 위치 임베딩을 무작위로 제거함으로써 탐지 성능을 향상시키고, 탐지 미세 조정 중 오픈-어휘 지식의 망각을 방지하며, 고정된 ViT 백본을 영역 분류기로 사용할 수 있게 한다. LVIS 오픈-어휘 탐지 벤치마크에서 CFM-ViT는 33.9 APr로 최신 기술을 달성하며, 최고의 접근법을 7.6점 앞서고 더 나은 제로샷 탐지 전이를 달성한다. 마지막으로, CFM-ViT는 강력한 이미지 수준 표현을 획득하여 제로샷 이미지-텍스트 검색 벤치마크에서 12개 지표 중 8개에서 최신 기술을 능가한다.

English

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

대조적 특징 마스킹 오픈-보컬러리 비전 트랜스포머

Contrastive Feature Masking Open-Vocabulary Vision Transformer

초록

Support