YOLO-World：即時開放詞彙物件偵測

摘要

You Only Look Once (YOLO) 系列檢測器已被證明是高效且實用的工具。然而，它們對預定義和訓練好的物件類別的依賴限制了它們在開放場景中的應用。為了解決這個限制，我們引入了 YOLO-World，這是一種創新方法，通過視覺-語言建模和在大規模數據集上的預訓練來增強 YOLO 的開放詞彙檢測能力。具體來說，我們提出了一種新的可重新參數化視覺-語言路徑聚合網絡（RepVL-PAN）和區域-文本對比損失，以促進視覺和語言信息之間的交互作用。我們的方法在零樣本方式下以高效率優異地檢測各種物件。在具有挑戰性的 LVIS 數據集上，YOLO-World 在 V100 上實現了 35.4 的 AP，並以 52.0 FPS 的速度優於許多最先進的方法，無論是在準確性還是速度方面。此外，經過微調的 YOLO-World 在幾個下游任務上表現出色，包括物件檢測和開放詞彙實例分割。

English

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

YOLO-World：即時開放詞彙物件偵測

YOLO-World: Real-Time Open-Vocabulary Object Detection

摘要

Support