即時物件偵測遇上DINOv3

摘要

得益於Dense O2O和MAL的簡潔高效，DEIM已成為實時DETR的主流訓練框架，顯著超越了YOLO系列。在本研究中，我們通過引入DINOv3特徵對其進行擴展，形成了DEIMv2。DEIMv2涵蓋了從X到Atto的八種模型規模，適用於GPU、邊緣設備及移動端部署。針對X、L、M和S版本，我們採用了DINOv3預訓練或蒸餾的骨幹網絡，並引入了空間調適適配器（STA），它能高效地將DINOv3的單尺度輸出轉化為多尺度特徵，並以細粒度細節補充強語義信息，從而提升檢測性能。對於超輕量級模型（Nano、Pico、Femto和Atto），我們採用深度與寬度剪枝的HGNetv2，以滿足嚴格的資源限制。結合簡化的解碼器和升級版的Dense O2O，這一統一設計使DEIMv2在多樣化場景中實現了卓越的性能成本比，創下了新的技術標杆。特別值得一提的是，我們最大的模型DEIMv2-X僅需5030萬參數便達到了57.8 AP，超越了此前需要超過6000萬參數才能達到56.5 AP的X級別模型。在緊湊型方面，DEIMv2-S成為首個參數少於1000萬（971萬）卻在COCO上突破50 AP里程碑的模型，達到50.9 AP。即便是僅有150萬參數的超輕量級DEIMv2-Pico，也能提供38.5 AP，與參數多出約50%的YOLOv10-Nano（230萬）持平。我們的代碼及預訓練模型已公開於https://github.com/Intellindust-AI-Lab/DEIMv2。

English

Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2

即時物件偵測遇上DINOv3

Real-Time Object Detection Meets DINOv3

摘要

Support