重新审视 DETR 目标检测的预训练
Revisiting DETR Pre-training for Object Detection
August 2, 2023
作者: Yan Ma, Weicong Liang, Yiduo Hao, Bohan Chen, Xiangyu Yue, Chao Zhang, Yuhui Yuan
cs.AI
摘要
受到基于DETR的方法在COCO检测和分割基准上取得新纪录的启发,许多最近的努力表明人们越来越关注如何通过在保持骨干网络冻结的同时以自监督方式预训练Transformer来进一步改进基于DETR的方法。一些研究已经声称在准确性方面取得了显著改进。在本文中,我们仔细研究了他们的实验方法,并检查他们的方法是否仍然对最新的H-Deformable-DETR等最先进技术有效。我们在COCO目标检测任务上进行了彻底的实验,以研究预训练数据集的选择、定位和分类目标生成方案的影响。不幸的是,我们发现先前代表性的自监督方法,如DETReg,在完整数据范围上未能提升强大的基于DETR的方法的性能。我们进一步分析原因,并发现简单地结合更准确的框预测器和Objects365基准可以显著改善后续实验的结果。我们通过在COCO验证集上实现59.3%的AP强大目标检测结果来展示我们方法的有效性,超过H-Deformable-DETR + Swin-L的1.4%。最后,我们通过结合最新的图像到文本字幕模型(LLaVA)和文本到图像生成模型(SDXL)生成一系列合成预训练数据集。值得注意的是,在这些合成数据集上进行预训练可以显著提高目标检测性能。展望未来,我们预计通过扩展合成预训练数据集将获得实质性优势。
English
Motivated by that DETR-based approaches have established new records on COCO
detection and segmentation benchmarks, many recent endeavors show increasing
interest in how to further improve DETR-based approaches by pre-training the
Transformer in a self-supervised manner while keeping the backbone frozen. Some
studies already claimed significant improvements in accuracy. In this paper, we
take a closer look at their experimental methodology and check if their
approaches are still effective on the very recent state-of-the-art such as
H-Deformable-DETR. We conduct thorough experiments on COCO object
detection tasks to study the influence of the choice of pre-training datasets,
localization, and classification target generation schemes. Unfortunately, we
find the previous representative self-supervised approach such as DETReg, fails
to boost the performance of the strong DETR-based approaches on full data
regimes. We further analyze the reasons and find that simply combining a more
accurate box predictor and Objects365 benchmark can significantly improve the
results in follow-up experiments. We demonstrate the effectiveness of our
approach by achieving strong object detection results of AP=59.3% on COCO
val set, which surpasses H-Deformable-DETR + Swin-L by +1.4%.
Last, we generate a series of synthetic pre-training datasets by combining the
very recent image-to-text captioning models (LLaVA) and text-to-image
generative models (SDXL). Notably, pre-training on these synthetic datasets
leads to notable improvements in object detection performance. Looking ahead,
we anticipate substantial advantages through the future expansion of the
synthetic pre-training dataset.