即時物件偵測遇上DINOv3
Real-Time Object Detection Meets DINOv3
September 25, 2025
作者: Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen
cs.AI
摘要
得益於Dense O2O和MAL的簡潔高效,DEIM已成為實時DETR的主流訓練框架,顯著超越了YOLO系列。在本研究中,我們通過引入DINOv3特徵對其進行擴展,形成了DEIMv2。DEIMv2涵蓋了從X到Atto的八種模型規模,適用於GPU、邊緣設備及移動端部署。針對X、L、M和S版本,我們採用了DINOv3預訓練或蒸餾的骨幹網絡,並引入了空間調適適配器(STA),它能高效地將DINOv3的單尺度輸出轉化為多尺度特徵,並以細粒度細節補充強語義信息,從而提升檢測性能。對於超輕量級模型(Nano、Pico、Femto和Atto),我們採用深度與寬度剪枝的HGNetv2,以滿足嚴格的資源限制。結合簡化的解碼器和升級版的Dense O2O,這一統一設計使DEIMv2在多樣化場景中實現了卓越的性能成本比,創下了新的技術標杆。特別值得一提的是,我們最大的模型DEIMv2-X僅需5030萬參數便達到了57.8 AP,超越了此前需要超過6000萬參數才能達到56.5 AP的X級別模型。在緊湊型方面,DEIMv2-S成為首個參數少於1000萬(971萬)卻在COCO上突破50 AP里程碑的模型,達到50.9 AP。即便是僅有150萬參數的超輕量級DEIMv2-Pico,也能提供38.5 AP,與參數多出約50%的YOLOv10-Nano(230萬)持平。我們的代碼及預訓練模型已公開於https://github.com/Intellindust-AI-Lab/DEIMv2。
English
Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM
has become the mainstream training framework for real-time DETRs, significantly
outperforming the YOLO series. In this work, we extend it with DINOv3 features,
resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering
GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt
DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter
(STA), which efficiently converts DINOv3's single-scale output into multi-scale
features and complements strong semantics with fine-grained details to enhance
detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we
employ HGNetv2 with depth and width pruning to meet strict resource budgets.
Together with a simplified decoder and an upgraded Dense O2O, this unified
design enables DEIMv2 to achieve a superior performance-cost trade-off across
diverse scenarios, establishing new state-of-the-art results. Notably, our
largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters,
surpassing prior X-scale models that require over 60 million parameters for
just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model
(9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even
the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers
38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer
parameters. Our code and pre-trained models are available at
https://github.com/Intellindust-AI-Lab/DEIMv2