YOLO-World：实时开放词汇目标检测

摘要

YOLO（You Only Look Once）系列检测器已经被证明是高效和实用的工具。然而，它们依赖预定义和训练好的物体类别限制了它们在开放场景中的适用性。为了解决这一限制，我们引入了YOLO-World，这是一种创新方法，通过视觉-语言建模和在大规模数据集上的预训练，增强了YOLO的开放词汇检测能力。具体来说，我们提出了一种新的可重新参数化的视觉-语言路径聚合网络（RepVL-PAN）和区域-文本对比损失，以促进视觉和语言信息之间的交互。我们的方法在以零样本方式高效检测各种物体方面表现出色。在具有挑战性的LVIS数据集上，YOLO-World在V100上实现了35.4的AP，帧率为52.0，这超过了许多在准确性和速度方面的最新方法。此外，经过微调的YOLO-World在几个下游任务上表现出色，包括物体检测和开放词汇实例分割。

English

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

YOLO-World：实时开放词汇目标检测

YOLO-World: Real-Time Open-Vocabulary Object Detection

摘要

Support