DETR는 다중 스케일 또는 지역성 설계가 필요하지 않다

초록

본 논문은 "단순한" 특성을 유지한 개선된 DETR 탐지기를 제안한다: 단일 스케일 특징 맵과 특정 지역성 제약 없이 전역 교차 주의 계산을 사용하며, 이는 다중 스케일 및 지역성이라는 아키텍처적 귀납 편향을 디코더에 재도입한 기존의 선도적인 DETR 기반 탐지기와 대조된다. 우리는 다중 스케일 특징 맵과 지역성 제약의 부재를 보완하기 위해 단순한 설계 내에서 두 가지 간단한 기술이 놀랍도록 효과적임을 보여준다. 첫 번째는 교차 주의 공식에 추가된 박스-픽셀 상대 위치 편향(BoxRPB) 항으로, 각 쿼리가 해당 객체 영역에 주의를 기울이도록 잘 안내하면서도 인코딩 유연성을 제공한다. 두 번째는 마스크된 이미지 모델링(MIM) 기반 백본 사전 학습으로, 미세한 위치 파악 능력을 갖춘 표현 학습을 돕고 다중 스케일 특징 맵에 대한 의존성을 해결하는 데 결정적으로 중요함이 입증되었다. 이러한 기술과 최근의 훈련 및 문제 구성의 발전을 통합함으로써, 개선된 "단순한" DETR은 원본 DETR 탐지기 대비 뛰어난 성능 향상을 보였다. Object365 데이터셋을 활용한 사전 학습을 통해 Swin-L 백본을 사용하여 63.9 mAP 정확도를 달성했으며, 이는 다중 스케일 특징 맵과 영역 기반 특징 추출에 크게 의존하는 최첨단 탐지기들과도 매우 경쟁력 있는 수준이다. 코드는 https://github.com/impiga/Plain-DETR에서 확인할 수 있다.

English

This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

DETR는 다중 스케일 또는 지역성 설계가 필요하지 않다

DETR Doesn't Need Multi-Scale or Locality Design

초록

Support