实时目标检测遇上DINOv3

摘要

得益于Dense O2O和MAL的简洁高效，DEIM已成为实时DETR训练的主流框架，显著超越了YOLO系列。在本研究中，我们将其扩展至包含DINOv3特征，推出了DEIMv2。DEIMv2涵盖从X到Atto的八种模型规模，适配GPU、边缘及移动设备部署。针对X、L、M、S版本，我们采用DINOv3预训练或蒸馏的骨干网络，并引入空间调谐适配器（STA），该组件高效地将DINOv3的单尺度输出转化为多尺度特征，并通过细粒度细节补充强语义信息，从而提升检测性能。对于超轻量级模型（Nano、Pico、Femto、Atto），我们采用HGNetv2并结合深度与宽度剪枝，以满足严格的资源限制。配合简化解码器与升级版Dense O2O，这一统一设计使DEIMv2在多样场景下实现了卓越的性能成本比，树立了新的技术标杆。特别地，我们的最大模型DEIMv2-X仅需5030万参数即达到57.8 AP，超越了此前需超过6000万参数仅获56.5 AP的X级模型。在紧凑型方面，DEIMv2-S成为首个在COCO上突破50 AP大关的千万级以下模型（971万），达到50.9 AP。即便是超轻量级的DEIMv2-Pico，仅150万参数便实现了38.5 AP，与YOLOv10-Nano（230万）持平，而参数数量减少了约50%。我们的代码与预训练模型已发布于https://github.com/Intellindust-AI-Lab/DEIMv2。

English

Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2

实时目标检测遇上DINOv3

Real-Time Object Detection Meets DINOv3

摘要

Support