HDINO: 간결하고 효율적인 오픈-보캐블러리 검출기

초록

최근 오픈-보케뷸러리 객체 탐지에 대한 관심이 높아지고 있지만, 기존 대부분의 방법은 수동으로 정제된 세분화된 학습 데이터셋과 리소스 집약적인 계층별 크로스 모달 특징 추출에 크게 의존합니다. 본 논문에서는 이러한 구성 요소에 대한 의존성을 제거한 간결하면서도 효율적인 오픈-보케뷸러리 객체 탐지기인 HDINO를 제안합니다. 구체적으로, 트랜스포머 기반 DINO 모델 위에 구축된 2단계 학습 전략을 제안합니다. 첫 번째 단계에서는 노이즈 샘플을 추가적인 긍정 객체 인스턴스로 간주하여 시각 모달리티와 텍스트 모달리티 간의 일대다 의미 정렬 메커니즘(O2M)을 구성함으로써 의미 정렬을 촉진합니다. 또한 초기 탐지 난이도를 기반으로 난이도 가중 분류 손실(DWCL)을 설계하여 난이도 높은 예제를 발굴하고 모델 성능을 더욱 향상시킵니다. 두 번째 단계에서는 정렬된 표현에 경량 특징 융합 모듈을 적용하여 언어적 의미에 대한 민감도를 향상시킵니다. Swin Transformer-T 설정에서 HDINO-T는 두 개의 공개 탐지 데이터셋에서 220만 장의 학습 이미지를 사용하여 COCO에서 49.2 mAP를 달성했으며, 어떠한 수동 데이터 정제나 grounding 데이터 사용 없이도 540만 장과 650만 장의 이미지로 학습된 Grounding DINO-T 및 T-Rex2를 각각 0.8 mAP, 2.8 mAP 앞질렀습니다. COCO에 대한 미세 조정 후 HDINO-T와 HDINO-L은 각각 56.4 mAP와 59.2 mAP를 추가로 달성하여 본 접근법의 효과성과 확장성을 입증했습니다. 코드와 모델은 https://github.com/HaoZ416/HDINO에서 이용할 수 있습니다.

English

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves 49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve 56.4 mAP and 59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

HDINO: 간결하고 효율적인 오픈-보캐블러리 검출기

HDINO: A Concise and Efficient Open-Vocabulary Detector

초록

Support