HDINO: Ein präziser und effizienter Open-Vocabulary-Detektor

Zusammenfassung

Trotz des wachsenden Interesses an Open-Vocabulary Object Detection in den letzten Jahren sind die meisten bestehenden Methoden stark auf manuell kuratierte, feingranulare Trainingsdatensätze sowie ressourcenintensive, schichtweise Kreuzmodal-Feature-Extraktion angewiesen. In diesem Artikel schlagen wir HDINO vor, einen prägnanten und dennoch effizienten Open-Vocabulary Object Detector, der die Abhängigkeit von diesen Komponenten beseitigt. Konkret schlagen wir eine zweistufige Trainingsstrategie vor, die auf dem transformerbasierten DINO-Modell aufbaut. In der ersten Stufe werden verrauschte Stichproben als zusätzliche positive Objektinstanten behandelt, um einen One-to-Many Semantic Alignment Mechanism (O2M) zwischen den visuellen und textuellen Modalitäten zu konstruieren und dadurch die semantische Ausrichtung zu erleichtern. Ein Difficulty Weighted Classification Loss (DWCL) wird ebenfalls auf Basis der anfänglichen Erkennungsschwierigkeit entworfen, um Hard Examples zu identifizieren und die Modellleistung weiter zu verbessern. In der zweiten Stufe wird ein leichtgewichtiges Feature-Fusion-Modul auf die ausgerichteten Repräsentationen angewendet, um die Sensitivität für linguistische Semantik zu erhöhen. Unter der Swin-Transformer-T-Konfiguration erreicht HDINO-T 49,2 mAP auf COCO unter Verwendung von 2,2 Mio. Trainingsbildern aus zwei öffentlich verfügbaren Detektionsdatensätzen – ohne jegliche manuelle Datenkuratierung und ohne die Verwendung von Grounding-Daten – und übertrifft damit Grounding DINO-T und T-Rex2 um 0,8 mAP bzw. 2,8 mAP, welche auf 5,4 Mio. bzw. 6,5 Mio. Bildern trainiert wurden. Nach Feinabstimmung auf COCO erreichen HDINO-T und HDINO-L weiterhin 56,4 mAP bzw. 59,2 mAP, was die Wirksamkeit und Skalierbarkeit unseres Ansatzes unterstreicht. Code und Modelle sind verfügbar unter https://github.com/HaoZ416/HDINO.

English

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves 49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve 56.4 mAP and 59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

HDINO: Ein präziser und effizienter Open-Vocabulary-Detektor

HDINO: A Concise and Efficient Open-Vocabulary Detector

Zusammenfassung

Support