ChatPaper.aiChatPaper

即時物件偵測遇上DINOv3

Real-Time Object Detection Meets DINOv3

September 25, 2025
作者: Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen
cs.AI

摘要

得益於Dense O2O和MAL的簡潔高效,DEIM已成為實時DETR的主流訓練框架,顯著超越了YOLO系列。在本研究中,我們通過引入DINOv3特徵對其進行擴展,形成了DEIMv2。DEIMv2涵蓋了從X到Atto的八種模型規模,適用於GPU、邊緣設備及移動端部署。針對X、L、M和S版本,我們採用了DINOv3預訓練或蒸餾的骨幹網絡,並引入了空間調適適配器(STA),它能高效地將DINOv3的單尺度輸出轉化為多尺度特徵,並以細粒度細節補充強語義信息,從而提升檢測性能。對於超輕量級模型(Nano、Pico、Femto和Atto),我們採用深度與寬度剪枝的HGNetv2,以滿足嚴格的資源限制。結合簡化的解碼器和升級版的Dense O2O,這一統一設計使DEIMv2在多樣化場景中實現了卓越的性能成本比,創下了新的技術標杆。特別值得一提的是,我們最大的模型DEIMv2-X僅需5030萬參數便達到了57.8 AP,超越了此前需要超過6000萬參數才能達到56.5 AP的X級別模型。在緊湊型方面,DEIMv2-S成為首個參數少於1000萬(971萬)卻在COCO上突破50 AP里程碑的模型,達到50.9 AP。即便是僅有150萬參數的超輕量級DEIMv2-Pico,也能提供38.5 AP,與參數多出約50%的YOLOv10-Nano(230萬)持平。我們的代碼及預訓練模型已公開於https://github.com/Intellindust-AI-Lab/DEIMv2。
English
Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2
PDF62September 29, 2025