Grounding DINO 1.5: オープンセット物体検知の「エッジ」を進化させる

要旨

本論文では、IDEA Researchが開発したオープンセット物体検知の最先端を目指すGrounding DINO 1.5スイートを紹介する。このスイートは2つのモデルで構成されている：幅広いシナリオでの強力な汎化能力を目指した高性能モデルであるGrounding DINO 1.5 Proと、エッジデプロイメントを必要とする多くのアプリケーションで求められる高速化に最適化された効率的なモデルであるGrounding DINO 1.5 Edgeである。Grounding DINO 1.5 Proモデルは、モデルアーキテクチャのスケールアップ、強化された視覚バックボーンの統合、そして2000万枚以上のグラウンディングアノテーション付き画像を含むトレーニングデータセットの拡張により、先行モデルを進化させ、より豊かな意味理解を実現している。一方、Grounding DINO 1.5 Edgeモデルは、特徴スケールを縮小して効率性を追求しながらも、同じ包括的なデータセットでトレーニングされることで堅牢な検知能力を維持している。実証結果は、Grounding DINO 1.5の有効性を示しており、Grounding DINO 1.5 ProモデルはCOCO検知ベンチマークで54.3 AP、LVIS-minivalゼロショット転移ベンチマークで55.7 APを達成し、オープンセット物体検知の新記録を樹立した。さらに、Grounding DINO 1.5 Edgeモデルは、TensorRTで最適化された場合、LVIS-minivalベンチマークで36.2 APのゼロショット性能を維持しながら75.2 FPSの速度を達成し、エッジコンピューティングシナリオにより適していることを示している。モデルの例とAPI付きデモはhttps://github.com/IDEA-Research/Grounding-DINO-1.5-APIで公開される予定である。

English

This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at https://github.com/IDEA-Research/Grounding-DINO-1.5-API

Grounding DINO 1.5: オープンセット物体検知の「エッジ」を進化させる

Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

要旨

Support