DETR Heeft Geen Multi-Schaal of Localiteitsontwerp Nodig

Samenvatting

Dit artikel presenteert een verbeterde DETR-detector die een "eenvoudige" aard behoudt: het gebruikt een enkelvoudige feature map en globale cross-attention berekeningen zonder specifieke localiteitsbeperkingen, in tegenstelling tot eerdere toonaangevende DETR-gebaseerde detectoren die architectonische inductieve biases van multi-schaal en localiteit opnieuw introduceren in de decoder. We tonen aan dat twee eenvoudige technologieën verrassend effectief zijn binnen een eenvoudig ontwerp om het gebrek aan multi-schaal feature maps en localiteitsbeperkingen te compenseren. De eerste is een box-to-pixel relatieve positiebias (BoxRPB) term die toegevoegd wordt aan de cross-attention formulering, die elke query goed begeleidt om aandacht te besteden aan het corresponderende objectgebied terwijl het ook coderingsflexibiliteit biedt. De tweede is masked image modeling (MIM)-gebaseerde backbone pre-training die helpt bij het leren van representaties met fijnmazige localisatievaardigheid en cruciaal blijkt voor het verhelpen van afhankelijkheden van de multi-schaal feature maps. Door deze technologieën en recente vooruitgang in training en probleemformulering te integreren, toonde de verbeterde "eenvoudige" DETR uitzonderlijke verbeteringen ten opzichte van de originele DETR-detector. Door gebruik te maken van het Object365 dataset voor pre-training, behaalde het een nauwkeurigheid van 63.9 mAP met een Swin-L backbone, wat zeer competitief is met state-of-the-art detectoren die allemaal sterk afhankelijk zijn van multi-schaal feature maps en regio-gebaseerde feature extractie. Code is beschikbaar op https://github.com/impiga/Plain-DETR.

English

This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

DETR Heeft Geen Multi-Schaal of Localiteitsontwerp Nodig

DETR Doesn't Need Multi-Scale or Locality Design

Samenvatting

Support