Transformador de Visión de Vocabulario Abierto con Enmascaramiento de Características Contrastivas

Resumen

Presentamos Contrastive Feature Masking Vision Transformer (CFM-ViT), una metodología de preentrenamiento imagen-texto que logra el aprendizaje simultáneo de representaciones a nivel de imagen y región para la detección de objetos de vocabulario abierto (OVD). Nuestro enfoque combina el objetivo del autoencoder enmascarado (MAE) con el objetivo de aprendizaje contrastivo para mejorar la representación en tareas de localización. A diferencia del MAE estándar, realizamos la reconstrucción en el espacio de incrustación conjunto imagen-texto, en lugar del espacio de píxeles como es habitual en el método MAE clásico, lo que permite al modelo aprender mejor la semántica a nivel de región. Además, introducimos Positional Embedding Dropout (PED) para abordar la variación de escala entre el preentrenamiento imagen-texto y el ajuste fino de detección, eliminando aleatoriamente las incrustaciones posicionales durante el preentrenamiento. PED mejora el rendimiento en detección y permite el uso de un backbone ViT congelado como clasificador de regiones, evitando el olvido del conocimiento de vocabulario abierto durante el ajuste fino de detección. En el benchmark de detección de vocabulario abierto LVIS, CFM-ViT alcanza un estado del arte de 33.9 APr, superando al mejor enfoque por 7.6 puntos y logrando una mejor transferencia de detección zero-shot. Finalmente, CFM-ViT adquiere una representación sólida a nivel de imagen, superando al estado del arte en 8 de 12 métricas en benchmarks de recuperación imagen-texto zero-shot.

English

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

Transformador de Visión de Vocabulario Abierto con Enmascaramiento de Características Contrastivas

Contrastive Feature Masking Open-Vocabulary Vision Transformer

Resumen

Support