重新檢視 DETR 用於物件偵測的預訓練
Revisiting DETR Pre-training for Object Detection
August 2, 2023
作者: Yan Ma, Weicong Liang, Yiduo Hao, Bohan Chen, Xiangyu Yue, Chao Zhang, Yuhui Yuan
cs.AI
摘要
受到基於DETR的方法在COCO檢測和分割基準上建立新紀錄的啟發,許多最近的努力顯示對如何通過在凍結主幹的同時以自監督方式預訓練Transformer進一步改進基於DETR的方法越來越感興趣。一些研究已聲稱在準確性方面取得了顯著進展。在本文中,我們仔細研究了他們的實驗方法,並檢查他們的方法是否仍然對最新的H-Deformable-DETR等最新技術有效。我們對COCO物體檢測任務進行了全面實驗,以研究預訓練數據集的選擇、定位和分類目標生成方案的影響。不幸的是,我們發現以前的代表性自監督方法,如DETReg,在完整數據範疇上無法提升強大的基於DETR的方法的性能。我們進一步分析原因,發現僅僅結合更準確的框預測器和Objects365基準可以顯著改善後續實驗的結果。我們通過在COCO驗證集上實現AP=59.3%的強大物體檢測結果來展示我們方法的有效性,超越了H-Deformable-DETR + Swin-L的+1.4%。最後,我們通過結合最新的圖像到文本標題模型(LLaVA)和文本到圖像生成模型(SDXL)生成一系列合成預訓練數據集。值得注意的是,在這些合成數據集上進行預訓練導致物體檢測性能顯著提升。展望未來,我們預期通過擴展合成預訓練數據集將獲得實質性優勢。
English
Motivated by that DETR-based approaches have established new records on COCO
detection and segmentation benchmarks, many recent endeavors show increasing
interest in how to further improve DETR-based approaches by pre-training the
Transformer in a self-supervised manner while keeping the backbone frozen. Some
studies already claimed significant improvements in accuracy. In this paper, we
take a closer look at their experimental methodology and check if their
approaches are still effective on the very recent state-of-the-art such as
H-Deformable-DETR. We conduct thorough experiments on COCO object
detection tasks to study the influence of the choice of pre-training datasets,
localization, and classification target generation schemes. Unfortunately, we
find the previous representative self-supervised approach such as DETReg, fails
to boost the performance of the strong DETR-based approaches on full data
regimes. We further analyze the reasons and find that simply combining a more
accurate box predictor and Objects365 benchmark can significantly improve the
results in follow-up experiments. We demonstrate the effectiveness of our
approach by achieving strong object detection results of AP=59.3% on COCO
val set, which surpasses H-Deformable-DETR + Swin-L by +1.4%.
Last, we generate a series of synthetic pre-training datasets by combining the
very recent image-to-text captioning models (LLaVA) and text-to-image
generative models (SDXL). Notably, pre-training on these synthetic datasets
leads to notable improvements in object detection performance. Looking ahead,
we anticipate substantial advantages through the future expansion of the
synthetic pre-training dataset.