DETR 不需要多尺度或局部性設計。

摘要

本文提出了一種改進的DETR檢測器，保持了「純粹」的特性：使用單一尺度特徵映射和全局交叉注意力計算，沒有特定的局部約束，與之前主導的基於DETR的檢測器相比，這些檢測器重新引入了解碼器中的多尺度和局部結構性偏見。我們展示了兩種簡單的技術在純粹設計中出奇地有效，以彌補多尺度特徵映射和局部約束的缺失。第一種是添加到交叉注意力公式中的盒子到像素相對位置偏差（BoxRPB）項，這有助於引導每個查詢關注對應的物體區域，同時提供編碼靈活性。第二種是基於遮罩圖像建模（MIM）的骨幹預訓練，有助於學習具有細粒度定位能力的表示，對補救對多尺度特徵映射的依賴至關重要。通過結合這些技術和最新的訓練和問題形成進展，改進的「純粹」DETR顯示出比原始DETR檢測器顯著的改進。通過利用Object365數據集進行預訓練，它使用Swin-L骨幹實現了63.9 mAP的準確性，這在與那些極大依賴多尺度特徵映射和基於區域的特徵提取的最先進檢測器相比具有很高的競爭力。代碼可在https://github.com/impiga/Plain-DETR找到。

English

This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code is available at https://github.com/impiga/Plain-DETR .

DETR 不需要多尺度或局部性設計。

DETR Doesn't Need Multi-Scale or Locality Design

摘要

Support