YOLO-Master:基于专家级Transformer与混合专家架构的实时检测加速模型
YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
December 29, 2025
作者: Xu Lin, Jinlong Peng, Zhenye Gan, Jiawen Zhu, Jun Liu
cs.AI
摘要
现有实时目标检测方法普遍采用类YOLO架构,以兼顾精度与速度的平衡。然而,这些模型依赖静态密集计算机制,对所有输入进行统一处理,导致表征能力和计算资源分配失当——例如在简单场景中过度分配资源,而在复杂场景中资源不足。这种错配既造成计算冗余,也导致检测性能次优。为突破此局限,我们提出新型类YOLO框架YOLO-Master,首次在实时目标检测中实现实例条件化自适应计算。该框架通过高效稀疏专家混合模块,能根据输入图像的场景复杂度动态分配计算资源。其核心在于轻量级动态路由网络,该网络通过多样性增强目标引导专家在训练过程中实现专业化,促进专家间形成互补性专长。此外,路由网络能自适应激活最相关的专家,在提升检测性能的同时最大限度减少推理时的计算开销。在五大基准数据集上的综合实验表明,YOLO-Master在MS COCO数据集上以1.62毫秒延迟取得42.4%的AP,较YOLOv13-N提升0.8% mAP且推理速度加快17.8%。值得注意的是,该模型在挑战性密集场景中提升尤为显著,同时保持对常规输入的高效处理能力,并始终维持实时推理速度。代码将开源发布。
English
Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.