HDINO: 簡潔かつ効率的なオープンボキャブラリー検出器

要旨

近年、オープンボキャブラリー物体検知への関心が高まっているものの、既存手法の多くは手作業で厳選された細粒度の訓練データセットと、リソース集約的な層単位のクロスモーダル特徴抽出に大きく依存している。本論文では、これらの要素への依存を排除した、簡潔かつ効率的なオープンボキャブラリー物体検出器HDINOを提案する。具体的には、TransformerベースのDINOモデルを基盤とした2段階の訓練戦略を提案する。第1段階では、ノイジーなサンプルを追加の正例オブジェクトインスタンスとして扱い、視覚モダリティとテキストモダリティ間のOne-to-Manyセマンティックアライメントメカニズム（O2M）を構築することで、意味的アライメントを促進する。さらに、初期の検出難易度に基づいて難易度重み付き分類損失（DWCL）を設計し、ハードサンプルのマイニングとモデル性能のさらなる向上を図る。第2段階では、アライメントされた表現に軽量な特徴融合モジュールを適用し、言語的セマンティクスへの感度を高める。Swin Transformer-T設定において、HDINO-Tは、2つの公開検出データセットから得た220万枚の訓練画像を用いてCOCOで49.2 mAPを達成した。これは、手動でのデータ選別やグラウンディングデータを一切使用せず、540万枚および650万枚の画像で訓練されたGrounding DINO-TおよびT-Rex2を、それぞれ0.8 mAP、2.8 mAP上回る結果である。COCOでのファインチューニング後、HDINO-TとHDINO-Lはそれぞれ56.4 mAPと59.2 mAPを達成し、本手法の有効性と拡張性が示された。コードとモデルはhttps://github.com/HaoZ416/HDINO で公開している。

English

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves 49.2 mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by 0.8 mAP and 2.8 mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve 56.4 mAP and 59.2 mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

HDINO: 簡潔かつ効率的なオープンボキャブラリー検出器

HDINO: A Concise and Efficient Open-Vocabulary Detector

要旨

Support