실시간 객체 탐지와 DINOv3의 만남

초록

Dense O2O와 MAL의 단순성과 효과성 덕분에 DEIM은 실시간 DETR을 위한 주류 학습 프레임워크로 자리 잡았으며, YOLO 시리즈를 크게 앞섰습니다. 본 연구에서는 DINOv3 기능을 확장하여 DEIMv2를 개발했습니다. DEIMv2는 X부터 Atto까지 8가지 모델 크기를 포괄하며, GPU, 엣지, 모바일 배포를 모두 지원합니다. X, L, M, S 변형의 경우, DINOv3 사전 학습 또는 증류된 백본을 채택하고 Spatial Tuning Adapter(STA)를 도입했습니다. STA는 DINOv3의 단일 스케일 출력을 다중 스케일 특징으로 효율적으로 변환하며, 강력한 의미론적 정보에 세밀한 디테일을 보완하여 탐지 성능을 향상시킵니다. 초경량 모델(Nano, Pico, Femto, Atto)의 경우, HGNetv2를 깊이와 너비 가지치기와 함께 사용하여 엄격한 자원 예산을 충족시켰습니다. 단순화된 디코더와 업그레이드된 Dense O2O와 함께, 이러한 통합 설계는 DEIMv2가 다양한 시나리오에서 우수한 성능-비용 균형을 달성하게 하여 새로운 최첨단 결과를 수립했습니다. 특히, 가장 큰 모델인 DEIMv2-X는 단 5,030만 개의 매개변수로 57.8 AP를 달성하며, 6,000만 개 이상의 매개변수가 필요한 기존 X-스케일 모델의 56.5 AP를 능가했습니다. 컴팩트 측면에서, DEIMv2-S는 COCO에서 50 AP를 넘어선 최초의 1,000만 개 미만 모델(971만 개)로, 50.9 AP를 기록했습니다. 심지어 초경량 DEIMv2-Pico는 단 150만 개의 매개변수로 38.5 AP를 제공하며, YOLOv10-Nano(230만 개)와 동등한 성능을 매개변수 수의 약 50% 감소로 달성했습니다. 우리의 코드와 사전 학습된 모델은 https://github.com/Intellindust-AI-Lab/DEIMv2에서 확인할 수 있습니다.

English

Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2

실시간 객체 탐지와 DINOv3의 만남

Real-Time Object Detection Meets DINOv3

초록

Support