YOLOE：實時視覺感知萬物

摘要

物件偵測與分割技術在電腦視覺應用中廣泛使用，然而傳統模型如YOLO系列，雖具高效能與精確度，卻受限於預定義類別，在開放場景中的適應性受到阻礙。近期開放集方法利用文字提示、視覺線索或無提示模式來克服此限制，但常因高計算需求或部署複雜性而在效能與效率間做出妥協。本研究提出YOLOE，它整合了多樣化開放提示機制下的偵測與分割於單一高效能模型中，實現了即時「見物識物」的能力。針對文字提示，我們提出可重參數化的區域-文字對齊策略（RepRTA），透過可重參數化的輕量輔助網絡精煉預訓練的文字嵌入，並在零推理與轉移開銷下強化視覺-文字對齊。對於視覺提示，我們提出語義激活視覺提示編碼器（SAVPE），採用解耦的語義與激活分支，以最小複雜度提升視覺嵌入與準確性。在無提示情境下，我們引入懶惰區域-提示對比策略（LRPC），利用內建的大詞彙庫與專用嵌入來識別所有物件，避免依賴昂貴的語言模型。大量實驗顯示，YOLOE在零樣本效能與可轉移性上表現卓越，具備高推理效率與低訓練成本。特別是在LVIS數據集上，YOLOE-v8-S以三倍少的訓練成本與1.4倍的推理速度，超越YOLO-Worldv2-S達3.5 AP。轉移至COCO數據集時，YOLOE-v8-L相較於封閉集YOLOv8-L，在AP^b與AP^m上分別提升0.6與0.4，且訓練時間減少近四倍。程式碼與模型可於https://github.com/THU-MIG/yoloe取得。

English

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3times less training cost and 1.4times inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP^b and 0.4 AP^m gains over closed-set YOLOv8-L with nearly 4times less training time. Code and models are available at https://github.com/THU-MIG/yoloe.

YOLOE：實時視覺感知萬物

YOLOE: Real-Time Seeing Anything

摘要

Support