YOLO-World: 실시간 개방형 어휘 객체 탐지

초록

YOLO(You Only Look Once) 시리즈 탐지기는 효율적이고 실용적인 도구로 자리 잡았습니다. 그러나 이들은 미리 정의되고 학습된 객체 카테고리에 의존하기 때문에 개방형 시나리오에서의 적용 가능성이 제한적입니다. 이러한 한계를 해결하기 위해, 우리는 YOLO-World를 소개합니다. 이는 시각-언어 모델링과 대규모 데이터셋에 대한 사전 학습을 통해 YOLO에 개방형 어휘 탐지 기능을 강화한 혁신적인 접근 방식입니다. 구체적으로, 우리는 시각 정보와 언어 정보 간의 상호작용을 촉진하기 위해 새로운 Re-parameterizable Vision-Language Path Aggregation Network(RepVL-PAN)와 지역-텍스트 대조 손실을 제안합니다. 우리의 방법은 제로샷 방식으로 다양한 객체를 효율적으로 탐지하는 데 탁월한 성능을 보입니다. 도전적인 LVIS 데이터셋에서 YOLO-World는 V100에서 52.0 FPS로 35.4 AP를 달성하여 정확도와 속도 모두에서 많은 최신 방법들을 능가합니다. 또한, 미세 조정된 YOLO-World는 객체 탐지 및 개방형 어휘 인스턴스 분할을 포함한 여러 하위 작업에서 뛰어난 성능을 보입니다.

English

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

YOLO-World: 실시간 개방형 어휘 객체 탐지

YOLO-World: Real-Time Open-Vocabulary Object Detection

초록

Support