DINO-X: オープンワールド物体検出と理解のための統合ビジョンモデル

要旨

本論文では、IDEA Researchが開発した最高のオープンワールド物体検出性能を持つ統合されたオブジェクト中心のビジョンモデルであるDINO-Xを紹介します。DINO-Xは、Grounding DINO 1.5と同じTransformerベースのエンコーダーデコーダーアーキテクチャを採用し、オープンワールド物体理解のためのオブジェクトレベル表現を追求しています。長尾の物体検出を容易にするために、DINO-Xはテキストプロンプト、ビジュアルプロンプト、およびカスタマイズされたプロンプトをサポートする入力オプションを拡張しています。このような柔軟なプロンプトオプションを使用して、プロンプトなしのオープンワールド検出をサポートするための普遍的なオブジェクトプロンプトを開発し、ユーザーにプロンプトを提供することなく画像内の任意の物体を検出できるようにしています。モデルのコアグラウンディング能力を向上させるために、Grounding-100Mとして言及される1億以上の高品質なグラウンディングサンプルを持つ大規模データセットを構築し、モデルのオープンボキャブラリー検出性能を向上させています。このような大規模グラウンディングデータセットでの事前トレーニングにより、DINO-Xは複数の知覚ヘッドを統合して複数の物体知覚および理解タスク（検出、セグメンテーション、姿勢推定、オブジェクトキャプショニング、オブジェクトベースのQAなど）を同時にサポートする基本的なオブジェクトレベル表現を実現します。実験結果は、DINO-Xの優れた性能を示しています。具体的には、DINO-X Proモデルは、COCO、LVIS-minival、およびLVIS-valのゼロショット物体検出ベンチマークでそれぞれ56.0 AP、59.8 AP、52.4 APを達成しています。特に、LVIS-minivalおよびLVIS-valベンチマークのレアクラスで63.3 APおよび56.5 APを獲得し、いずれも以前のSOTA性能を5.8 AP向上させています。この結果は、長尾物体を認識する能力が大幅に向上していることを強調しています。

English

In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, both improving the previous SOTA performance by 5.8 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.

DINO-X: オープンワールド物体検出と理解のための統合ビジョンモデル

DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

要旨

Support