HDINO : Un détecteur concis et efficace à vocabulaire ouvert

Résumé

Malgré l'intérêt croissant pour la détection d'objets à vocabulaire ouvert ces dernières années, la plupart des méthodes existantes reposent fortement sur des ensembles de données d'entraînement manuellement organisés et à granularité fine, ainsi que sur une extraction de caractéristiques intermodales couche par couche gourmande en ressources. Dans cet article, nous proposons HDINO, un détecteur d'objets à vocabulaire ouvert concis mais efficace qui élimine la dépendance à ces composants. Plus précisément, nous proposons une stratégie d'entraînement en deux étapes basée sur le modèle DINO à base de transformers. Dans la première étape, les échantillons bruités sont traités comme des instances d'objets positifs supplémentaires pour construire un mécanisme d'alignement sémantique un-vers-plusieurs (O2M) entre les modalités visuelle et textuelle, facilitant ainsi l'alignement sémantique. Une fonction de perte de classification pondérée par la difficulté (DWCL) est également conçue sur la base de la difficulté de détection initiale pour extraire les exemples difficiles et améliorer davantage les performances du modèle. Dans la deuxième étape, un module de fusion de caractéristiques léger est appliqué aux représentations alignées pour améliorer la sensibilité à la sémantique linguistique. Avec le paramétrage Swin Transformer-T, HDINO-T atteint 49,2 mAP sur COCO en utilisant 2,2 millions d'images d'entraînement provenant de deux ensembles de données de détection publiques, sans aucune organisation manuelle des données ni utilisation de données d'ancrage, surpassant Grounding DINO-T et T-Rex2 de 0,8 mAP et 2,8 mAP respectivement, ces derniers étant entraînés sur 5,4 millions et 6,5 millions d'images. Après un affinage sur COCO, HDINO-T et HDINO-L atteignent respectivement 56,4 mAP et 59,2 mAP, soulignant l'efficacité et l'évolutivité de notre approche. Le code et les modèles sont disponibles à l'adresse https://github.com/HaoZ416/HDINO.

English

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves 49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve 56.4 mAP and 59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

HDINO : Un détecteur concis et efficace à vocabulaire ouvert

HDINO: A Concise and Efficient Open-Vocabulary Detector

Résumé

Support