HDINO:一种简洁高效的开集目标检测器
HDINO: A Concise and Efficient Open-Vocabulary Detector
March 3, 2026
作者: Hao Zhang, Yiqun Wang, Qinran Lin, Runze Fan, Yong Li
cs.AI
摘要
尽管近年来开放词汇目标检测领域关注度日益提升,但现有方法大多严重依赖人工标注的细粒度训练数据集以及资源密集型的逐层跨模态特征提取。本文提出HDINO——一种简洁高效的开放词汇检测器,无需依赖上述组件。具体而言,我们在基于Transformer的DINO模型基础上设计了两阶段训练策略:第一阶段将噪声样本作为附加正样本实例,构建视觉与文本模态间的一对多语义对齐机制(O2M),从而促进语义对齐;同时基于初始检测难度设计难度加权分类损失函数(DWCL),通过挖掘困难样本进一步提升模型性能。第二阶段对已对齐的表征施加轻量级特征融合模块,以增强对语言语义的敏感性。在Swin Transformer-T架构下,HDINO-T仅使用两个公开检测数据集中的220万训练图像(无需人工数据筛选和定位数据),在COCO数据集上达到49.2 mAP,较基于540万和650万图像训练的Grounding DINO-T和T-Rex2分别高出0.8 mAP和2.8 mAP。经COCO微调后,HDINO-T与HDINO-L进一步达到56.4 mAP和59.2 mAP,彰显了方法的有效性与可扩展性。代码与模型已开源:https://github.com/HaoZ416/HDINO。
English
Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves 49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve 56.4 mAP and 59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.