HDINO: Een beknopte en efficiënte open-vocabulary detector

Samenvatting

Ondanks de groeiende belangstelling voor open-vocabulary objectdetectie in recente jaren, zijn de meeste bestaande methoden sterk afhankelijk van handmatig samengestelde, fijnmazige trainingsdatasets en van resource-intensieve, laaggewijze kruismodale feature-extractie. In dit artikel stellen we HDINO voor, een beknopte doch efficiënte open-vocabulary objectdetector die de afhankelijkheid van deze componenten opheft. Concreet stellen we een tweefasige trainingsstrategie voor, gebaseerd op het transformer-gebaseerde DINO-model. In de eerste fase worden ruisrijke samples behandeld als aanvullende positieve objectinstanties om een One-to-Many Semantisch Afstemmingsmechanisme (O2M) tussen de visuele en tekstuele modaliteiten te construeren, waardoor semantische afstemming wordt bevorderd. Een op initiële detectiemoeilijkheid gebaseerd Moeilijkheidsgewogen Classificatieverlies (DWCL) wordt eveneens ontworpen om harde voorbeelden te delven en de modelprestatie verder te verbeteren. In de tweede fase wordt een lichtgewicht feature-fusiemodule toegepast op de afgestemde representaties om de gevoeligheid voor linguïstische semantiek te vergroten. Onder de Swin Transformer-T instelling behaalt HDINO-T 49.2 mAP op COCO met gebruik van 2.2M trainingsafbeeldingen uit twee publiek beschikbare detectiedatasets, zonder enige handmatige datacuratie en het gebruik van groundingdata, waarmee het Grounding DINO-T en T-Rex2 met respectievelijk 0.8 mAP en 2.8 mAP overtreft – modellen die zijn getraind op 5.4M en 6.5M afbeeldingen. Na fine-tuning op COCO behalen HDINO-T en HDINO-L verdere scores van 56.4 mAP en 59.2 mAP, wat de effectiviteit en schaalbaarheid van onze aanpak benadrukt. Code en modellen zijn beschikbaar op https://github.com/HaoZ416/HDINO.

English

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves 49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve 56.4 mAP and 59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

HDINO: Een beknopte en efficiënte open-vocabulary detector

HDINO: A Concise and Efficient Open-Vocabulary Detector

Samenvatting

Support