YOLO-World: リアルタイムオープン語彙物体検出

要旨

YOLO（You Only Look Once）シリーズの検出器は、効率的で実用的なツールとして確立されています。しかし、これらの検出器は事前に定義され訓練された物体カテゴリに依存しているため、オープンなシナリオでの適用性が制限されています。この制限に対処するため、我々はYOLO-Worldを提案します。これは、視覚と言語のモデリングと大規模データセットでの事前学習を通じて、YOLOにオープンな語彙検出能力を強化する革新的なアプローチです。具体的には、視覚と言語情報の相互作用を促進するために、新しい再パラメータ化可能な視覚言語パス集約ネットワーク（RepVL-PAN）と領域テキストコントラスト損失を提案します。我々の手法は、ゼロショット方式で幅広い物体を効率的に検出することに優れています。難しいとされるLVISデータセットにおいて、YOLO-WorldはV100上で35.4 APと52.0 FPSを達成し、精度と速度の両面で多くの最先端手法を上回ります。さらに、微調整されたYOLO-Worldは、物体検出やオープン語彙インスタンスセグメンテーションを含むいくつかの下流タスクで顕著な性能を発揮します。

English

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

YOLO-World: リアルタイムオープン語彙物体検出

YOLO-World: Real-Time Open-Vocabulary Object Detection

要旨

Support