YOLOE: リアルタイム物体検知の新境地

要旨

物体検出とセグメンテーションはコンピュータビジョンアプリケーションで広く利用されていますが、YOLOシリーズのような従来のモデルは効率的で正確である一方、事前定義されたカテゴリに制限されるため、オープンシナリオでの適応性が妨げられています。最近のオープンセット手法は、テキストプロンプト、視覚的キュー、またはプロンプトフリーパラダイムを活用してこれを克服しようとしていますが、高い計算要求や展開の複雑さのため、性能と効率性の間で妥協することが多いです。本研究では、YOLOEを紹介します。これは、多様なオープンプロンプトメカニズムを単一の高効率モデルに統合し、リアルタイムでの「何でも見る」能力を実現します。テキストプロンプトについては、再パラメータ化可能な領域-テキストアライメント（RepRTA）戦略を提案します。これは、再パラメータ化可能な軽量補助ネットワークを介して事前学習されたテキスト埋め込みを洗練し、推論や転送のオーバーヘッドなしで視覚的-テキスト的アライメントを強化します。視覚的プロンプトについては、セマンティック活性化視覚プロンプトエンコーダ（SAVPE）を提示します。これは、分離されたセマンティックと活性化ブランチを使用して、最小限の複雑さで改善された視覚的埋め込みと精度をもたらします。プロンプトフリーシナリオについては、レイジー領域-プロンプトコントラスト（LRPC）戦略を導入します。これは、組み込みの大規模語彙と特殊な埋め込みを利用してすべてのオブジェクトを識別し、高コストな言語モデルへの依存を回避します。大規模な実験により、YOLOEの優れたゼロショット性能と転送性、高い推論効率、低いトレーニングコストが示されています。特に、LVISでは、トレーニングコストが3分の1で、推論速度が1.4倍向上し、YOLOE-v8-SはYOLO-Worldv2-Sを3.5 AP上回ります。COCOに転送する際には、YOLOE-v8-LはクローズドセットのYOLOv8-Lに対して0.6 AP^bと0.4 AP^mの向上を達成し、トレーニング時間はほぼ4分の1です。コードとモデルはhttps://github.com/THU-MIG/yoloeで利用可能です。

English

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3times less training cost and 1.4times inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP^b and 0.4 AP^m gains over closed-set YOLOv8-L with nearly 4times less training time. Code and models are available at https://github.com/THU-MIG/yoloe.

YOLOE: リアルタイム物体検知の新境地

YOLOE: Real-Time Seeing Anything

要旨

Support