YOLOE: 실시간으로 모든 것을 보기

초록

객체 탐지와 세그멘테이션은 컴퓨터 비전 애플리케이션에서 널리 사용되지만, YOLO 시리즈와 같은 기존 모델들은 효율적이고 정확함에도 불구하고 미리 정의된 카테고리에 제한되어 있어 개방형 시나리오에서의 적응성이 떨어진다. 최근의 개방형 방법들은 텍스트 프롬프트, 시각적 단서 또는 프롬프트 없는 패러다임을 활용하여 이를 극복하려고 하지만, 높은 계산 요구량이나 배포 복잡성으로 인해 성능과 효율성 사이에서 타협을 보는 경우가 많다. 본 연구에서는 YOLOE를 소개하며, 이는 다양한 개방형 프롬프트 메커니즘을 단일 고효율 모델 내에 통합하여 실시간으로 모든 것을 볼 수 있는 능력을 달성한다. 텍스트 프롬프트의 경우, 재파라미터화 가능한 지역-텍스트 정렬(RepRTA) 전략을 제안한다. 이는 재파라미터화 가능한 경량 보조 네트워크를 통해 사전 학습된 텍스트 임베딩을 개선하고, 추론 및 전송 오버헤드 없이 시각-텍스트 정렬을 강화한다. 시각적 프롬프트의 경우, 의미론적 활성화 시각 프롬프트 인코더(SAVPE)를 제시한다. 이는 분리된 의미론 및 활성화 분기를 사용하여 최소한의 복잡성으로 개선된 시각 임베딩과 정확도를 제공한다. 프롬프트 없는 시나리오의 경우, Lazy Region-Prompt Contrast(LRPC) 전략을 도입한다. 이는 내장된 대규모 어휘와 특수화된 임베딩을 활용하여 모든 객체를 식별하며, 비용이 많이 드는 언어 모델 의존성을 피한다. 광범위한 실험을 통해 YOLOE의 탁월한 제로샷 성능과 전이 가능성을 높은 추론 효율성과 낮은 학습 비용으로 입증한다. 특히, LVIS에서 학습 비용이 3배 적고 추론 속도가 1.4배 빠른 YOLOE-v8-S는 YOLO-Worldv2-S를 3.5 AP로 능가한다. COCO로 전이할 때, YOLOE-v8-L은 폐쇄형 YOLOv8-L 대비 0.6 AP^b와 0.4 AP^m의 향상을 달성하며, 학습 시간이 거의 4배 적게 소요된다. 코드와 모델은 https://github.com/THU-MIG/yoloe에서 확인할 수 있다.

English

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3times less training cost and 1.4times inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP^b and 0.4 AP^m gains over closed-set YOLOv8-L with nearly 4times less training time. Code and models are available at https://github.com/THU-MIG/yoloe.

YOLOE: 실시간으로 모든 것을 보기

YOLOE: Real-Time Seeing Anything

초록

Support