物体検出のためのDETR事前学習の再検討

要旨

DETRベースのアプローチがCOCOの検出およびセグメンテーションベンチマークで新記録を樹立したことを受け、多くの最近の研究では、バックボーンを凍結したままTransformerを自己教師あり方式で事前学習することで、DETRベースのアプローチをさらに改善する方法に注目が集まっています。一部の研究では、精度の大幅な向上が既に報告されています。本論文では、それらの実験手法を詳細に検証し、H-Deformable-DETRのような最新の最先端技術においてもそれらのアプローチが有効かどうかを確認します。COCO物体検出タスクにおいて、事前学習データセットの選択、位置特定、および分類ターゲット生成スキームの影響を徹底的に調査します。残念ながら、DETRegのような従来の代表的な自己教師ありアプローチは、完全なデータ体制において強力なDETRベースのアプローチの性能を向上させることができませんでした。さらにその理由を分析し、より正確なボックス予測器とObjects365ベンチマークを単純に組み合わせることで、追跡実験において結果を大幅に改善できることを発見しました。私たちのアプローチの有効性を、COCO valセットでAP=59.3%という強力な物体検出結果を達成することで実証し、H-Deformable-DETR + Swin-Lを+1.4%上回りました。最後に、最新の画像からテキストへのキャプションモデル（LLaVA）とテキストから画像への生成モデル（SDXL）を組み合わせて、一連の合成事前学習データセットを生成します。注目すべきは、これらの合成データセットで事前学習を行うことで、物体検出性能が顕著に向上することです。今後の展望として、合成事前学習データセットの拡大を通じて、さらなる大きな利点が期待されます。

English

Motivated by that DETR-based approaches have established new records on COCO detection and segmentation benchmarks, many recent endeavors show increasing interest in how to further improve DETR-based approaches by pre-training the Transformer in a self-supervised manner while keeping the backbone frozen. Some studies already claimed significant improvements in accuracy. In this paper, we take a closer look at their experimental methodology and check if their approaches are still effective on the very recent state-of-the-art such as H-Deformable-DETR. We conduct thorough experiments on COCO object detection tasks to study the influence of the choice of pre-training datasets, localization, and classification target generation schemes. Unfortunately, we find the previous representative self-supervised approach such as DETReg, fails to boost the performance of the strong DETR-based approaches on full data regimes. We further analyze the reasons and find that simply combining a more accurate box predictor and Objects365 benchmark can significantly improve the results in follow-up experiments. We demonstrate the effectiveness of our approach by achieving strong object detection results of AP=59.3% on COCO val set, which surpasses H-Deformable-DETR + Swin-L by +1.4%. Last, we generate a series of synthetic pre-training datasets by combining the very recent image-to-text captioning models (LLaVA) and text-to-image generative models (SDXL). Notably, pre-training on these synthetic datasets leads to notable improvements in object detection performance. Looking ahead, we anticipate substantial advantages through the future expansion of the synthetic pre-training dataset.

物体検出のためのDETR事前学習の再検討

Revisiting DETR Pre-training for Object Detection

要旨

Support