Ultralytics YOLO26: 통합 실시간 엔드투엔드 비전 모델

초록

실시간 비전은 정확하고 효율적이며 다양한 하드웨어에 간편하게 배포할 수 있는 모델을 요구한다. 이러한 이유로 YOLO 계열은 널리 사용되어 왔으나, 대부분의 YOLO 탐지기는 여전히 추론 시 비최대 억제(NMS)에 의존하고, 분포 초점 손실(DFL)로 인해 무거운 탐지 헤드를 가지며, 긴 훈련 일정이 필요하고, 가장 작은 객체에 양성 레이블 할당이 이루어지지 않는 문제가 있다. 본 논문에서는 이러한 한계를 해결하기 위해 조정된 아키텍처 및 훈련 개선 사항을 통합한 실시간 비전 모델 제품군인 Ultralytics YOLO26을 제시한다. YOLO26은 이중 헤드 설계를 통해 기본적으로 NMS 없는 종단간 추론이 가능하며, DFL을 완전히 제거하여 제약 없는 회귀 범위를 갖는 더 가벼운 헤드를 구현한다. 훈련 파이프라인은 대규모 언어 모델 훈련에서 차용한 하이브리드 Muon-SGD 최적화기인 MuSGD, 추론 시 헤드를 향해 감독을 전환하는 점진적 손실(Progressive Loss), 그리고 소형 객체에 대해 양성 커버리지를 보장하는 레이블 할당 전략인 STAL을 결합한다. 탐지 외에도 YOLO26은 인스턴스 분할, 자세 추정, 방향 탐지를 위한 작업별 헤드 및 손실 설계를 도입하여 다양한 작업과 규모에서 일관된 성능 향상을 이끈다. 이 제품군은 5가지 규모(n/s/m/l/x)로 구성되며, 탐지, 인스턴스 분할, 자세 추정, 분류 및 방향 탐지를 단일 파이프라인으로 지원하고, 텍스트, 시각적 정보, 프롬프트 없이 추론이 가능한 개방형 어휘 확장 버전인 YOLOE-26도 포함한다. 모든 규모에서 YOLO26은 COCO에서 40.9-57.5 mAP를 달성하며, T4 TensorRT 지연 시간 1.7-11.8ms로 기존 실시간 탐지기 대비 정확도-지연 시간 파레토 프론트를 발전시켰다. 또한 YOLOE-26x는 텍스트 프롬프팅 방식으로 LVIS minival에서 40.6 AP를 달성한다. 코드와 모델은 https://github.com/ultralytics/ultralytics에서 확인할 수 있다.

English

Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.