Ultralytics YOLO26:统一实时端到端视觉模型
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
June 2, 2026
作者: Glenn Jocher, Jing Qiu, Mengyu Liu, Shuai Lyu, Fatih Cagatay Akyon, Muhammet Esat Kalfaoglu
cs.AI
摘要
实时视觉任务要求模型在多种硬件上同时具备准确性、高效性和易部署性。YOLO系列因此得到广泛应用,然而多数YOLO检测器在推理时仍依赖非极大值抑制(NMS)、因使用分布聚焦损失(DFL)导致检测头过重、训练周期较长,且可能使最小目标无法获得正标签分配。我们提出Ultralytics YOLO26——一个统一的实时视觉模型系列,通过协同架构与训练改进解决上述局限。YOLO26采用双检测头设计实现原生无NMS的端到端推理,并完全去除DFL,获得更轻量且回归范围无约束的检测头。其训练流程结合了MuSGD(一种从大语言模型训练改进的混合Muon-SGD优化器)、渐进损失(将监督信号逐步转向推理时检测头)以及STAL(一种保证小目标正样本覆盖的标签分配策略)。除检测外,YOLO26为实例分割、姿态估计和旋转目标检测引入了任务专属的检测头与损失设计,在各类任务与模型尺度上均实现一致性能提升。该系列涵盖五种尺度(n/s/m/l/x),支持检测、实例分割、姿态估计、分类及旋转目标检测于单一流程,并提供开放词汇扩展版YOLOE-26,实现无文本、无视觉提示的推理。在所有尺度下,YOLO26在COCO数据集上以1.7-11.8毫秒的T4 TensorRT延迟达到40.9-57.5 mAP,相较此前实时检测器刷新了精度-延迟帕累托前沿;而YOLOE-26x在文本提示下于LVIS minival上取得40.6 AP。代码与模型已开源:https://github.com/ultralytics/ultralytics。
English
Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.