Ultralytics YOLO26：統一的即時端到端視覺模型

摘要

实时视觉任务要求模型兼具准确性、效率性及跨异构硬件的易部署性。YOLO系列模型因此得到广泛部署，但现有大多数YOLO检测器在推理时仍需依赖非极大值抑制（NMS）、因使用分布焦点损失（DFL）导致检测头部过重、训练调度周期长，且最小目标可能缺乏正标签分配方案。本文提出Ultralytics YOLO26——面向架构与训练协同优化的统一实时视觉模型系列，旨在解决上述局限。YOLO26采用双头设计实现原生无NMS端到端推理，并完全移除DFL，生成更轻量且无约束回归范围的检测头。其训练流程融合三大创新：MuSGD（一种从大语言模型训练适配的混合Muon-SGD优化器）、渐进损失（Progressive Loss，将监督信号向推理时头部转移），以及STAL（一种保证小目标正覆盖率的标签分配策略）。除目标检测外，YOLO26分别为实例分割、姿态估计和旋转框检测设计了专用头部与损失函数，在各类任务与模型尺度上均实现一致性增益。该系列涵盖五类尺度（n/s/m/l/x），支持在统一流程中完成检测、实例分割、姿态估计、分类及旋转框检测，并推出开放词汇扩展版YOLOE-26，实现无文本、视觉及提示输入的推理。在所有尺度上，YOLO26在COCO数据集上以1.7-11.8毫秒的T4 TensorRT延迟达到40.9-57.5 mAP，相较现有实时检测器显著推进了精度-延迟帕累托前沿；而YOLOE-26x在文本提示下于LVIS minival数据集上达到40.6 AP。代码与模型已开源至 https://github.com/ultralytics/ultralytics。

English

Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.