DETR 不需要多尺度或局部性设计。

摘要

本文提出了一种改进的DETR检测器，保持了“简单”的特性：使用单尺度特征图和全局交叉注意力计算，而不像之前主要基于DETR的检测器那样在解码器中重新引入多尺度和局部性约束的结构归纳偏差。我们展示了两种简单的技术在简单设计中出人意料地有效，以弥补多尺度特征图和局部性约束的缺失。第一种是将盒子到像素的相对位置偏差（BoxRPB）项添加到交叉注意力公式中，这有助于引导每个查询关注相应的对象区域，同时提供编码灵活性。第二种是基于掩码图像建模（MIM）的骨干预训练，有助于学习具有细粒度定位能力的表示，并且对纠正对多尺度特征图的依赖至关重要。通过结合这些技术和最新的训练和问题形成进展，改进的“简单”DETR显示出比原始DETR检测器显著的改进。通过利用Object365数据集进行预训练，它在使用Swin-L骨干时实现了63.9的mAP准确率，这与所有严重依赖多尺度特征图和基于区域的特征提取的最先进检测器具有很高的竞争力。代码可在https://github.com/impiga/Plain-DETR 找到。

English

This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

DETR 不需要多尺度或局部性设计。

DETR Doesn't Need Multi-Scale or Locality Design

摘要

Support