ChatPaper.aiChatPaper

实时目标检测遇上DINOv3

Real-Time Object Detection Meets DINOv3

September 25, 2025
作者: Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen
cs.AI

摘要

得益于Dense O2O和MAL的简洁高效,DEIM已成为实时DETR训练的主流框架,显著超越了YOLO系列。在本研究中,我们将其扩展至包含DINOv3特征,推出了DEIMv2。DEIMv2涵盖从X到Atto的八种模型规模,适配GPU、边缘及移动设备部署。针对X、L、M、S版本,我们采用DINOv3预训练或蒸馏的骨干网络,并引入空间调谐适配器(STA),该组件高效地将DINOv3的单尺度输出转化为多尺度特征,并通过细粒度细节补充强语义信息,从而提升检测性能。对于超轻量级模型(Nano、Pico、Femto、Atto),我们采用HGNetv2并结合深度与宽度剪枝,以满足严格的资源限制。配合简化解码器与升级版Dense O2O,这一统一设计使DEIMv2在多样场景下实现了卓越的性能成本比,树立了新的技术标杆。特别地,我们的最大模型DEIMv2-X仅需5030万参数即达到57.8 AP,超越了此前需超过6000万参数仅获56.5 AP的X级模型。在紧凑型方面,DEIMv2-S成为首个在COCO上突破50 AP大关的千万级以下模型(971万),达到50.9 AP。即便是超轻量级的DEIMv2-Pico,仅150万参数便实现了38.5 AP,与YOLOv10-Nano(230万)持平,而参数数量减少了约50%。我们的代码与预训练模型已发布于https://github.com/Intellindust-AI-Lab/DEIMv2。
English
Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2
PDF62September 29, 2025