YOLO-Master: 専門化トランスフォーマーとMOEによる高速化を実現した強化版リアルタイム物体検出

要旨

既存のリアルタイム物体検出（RTOD）手法では、精度と速度の優れたトレードオフから、YOLO様のアーキテクチャが一般的に採用されている。しかし、これらのモデルは静的な密計算に依存しており、すべての入力に均一な処理を適用するため、表現能力と計算リソースの配分を誤っている。例えば、単純なシーンには過剰にリソースを割り当て、複雑なシーンにはリソース不足となる。このミスマッチは、計算の冗長性と検出性能の低下の両方を引き起こす。この課題を克服するため、本論文ではRTODのためのインスタンス条件付き適応的計算を導入した新規のYOLO様フレームワーク、YOLO-Masterを提案する。これは、シーンの複雑度に応じて各入力に対して計算リソースを動的に割り当てる効率的なスパースMixture-of-Experts（ES-MoE）ブロックによって実現される。中核となるのは、軽量な動的ルーティングネットワークであり、多様性を高める目的関数を通じて学習中にエキスパートの専門化を導き、エキスパート間の相補的な専門性を促進する。さらに、ルーティングネットワークは、最も関連性の高いエキスパートのみを適応的に活性化することを学習するため、推論時の計算オーバーヘッドを最小化しつつ検出性能を向上させる。5つの大規模ベンチマークによる包括的実験により、YOLO-Masterの優位性が実証された。MS COCOでは、本モデルは42.4% AP、1.62msのレイテンシを達成し、YOLOv13-Nを+0.8% mAPで上回り、推論速度は17.8%高速であった。特に、困難な高密度シーンにおいて性能向上が顕著であり、典型的な入力では効率を維持し、リアルタイム推論速度を保っている。コードは公開予定である。

English

Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.

YOLO-Master: 専門化トランスフォーマーとMOEによる高速化を実現した強化版リアルタイム物体検出

YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

要旨

Support