객체 탐지를 위한 DETR 사전 학습 재고찰

초록

DETR 기반 접근법이 COCO 탐지 및 세분화 벤치마크에서 새로운 기록을 수립한 것에 고무되어, 최근 많은 연구들이 백본을 고정한 상태에서 Transformer를 자기 지도 방식으로 사전 학습함으로써 DETR 기반 접근법을 더욱 개선하는 방법에 대해 점점 더 많은 관심을 보이고 있습니다. 일부 연구에서는 이미 정확도 측면에서 상당한 개선을 달성했다고 주장하고 있습니다. 본 논문에서는 이러한 연구들의 실험 방법론을 자세히 살펴보고, H-Deformable-DETR와 같은 최신 최첨단 기술에서도 여전히 효과적인지 검증합니다. COCO 객체 탐지 작업에 대한 철저한 실험을 통해 사전 학습 데이터셋 선택, 위치 지정 및 분류 목표 생성 방식의 영향을 연구합니다. 그러나 안타깝게도 DETReg와 같은 이전의 대표적인 자기 지도 학습 접근법은 전체 데이터 체제에서 강력한 DETR 기반 접근법의 성능을 향상시키지 못하는 것으로 나타났습니다. 우리는 그 이유를 더 깊이 분석하고, 더 정확한 박스 예측기와 Objects365 벤치마크를 단순히 결합하는 것만으로도 후속 실험에서 결과를 크게 개선할 수 있음을 발견했습니다. 우리의 접근법의 효과를 입증하기 위해 COCO 검증 세트에서 AP=59.3%의 강력한 객체 탐지 결과를 달성했으며, 이는 H-Deformable-DETR + Swin-L을 +1.4% 앞섭니다. 마지막으로, 최신 이미지-텍스트 캡션 생성 모델(LLaVA)과 텍스트-이미지 생성 모델(SDXL)을 결합하여 일련의 합성 사전 학습 데이터셋을 생성합니다. 특히, 이러한 합성 데이터셋으로 사전 학습을 수행하면 객체 탐지 성능이 눈에 띄게 향상됩니다. 앞으로는 합성 사전 학습 데이터셋의 확장을 통해 상당한 이점을 기대할 수 있을 것으로 예상됩니다.

English

Motivated by that DETR-based approaches have established new records on COCO detection and segmentation benchmarks, many recent endeavors show increasing interest in how to further improve DETR-based approaches by pre-training the Transformer in a self-supervised manner while keeping the backbone frozen. Some studies already claimed significant improvements in accuracy. In this paper, we take a closer look at their experimental methodology and check if their approaches are still effective on the very recent state-of-the-art such as H-Deformable-DETR. We conduct thorough experiments on COCO object detection tasks to study the influence of the choice of pre-training datasets, localization, and classification target generation schemes. Unfortunately, we find the previous representative self-supervised approach such as DETReg, fails to boost the performance of the strong DETR-based approaches on full data regimes. We further analyze the reasons and find that simply combining a more accurate box predictor and Objects365 benchmark can significantly improve the results in follow-up experiments. We demonstrate the effectiveness of our approach by achieving strong object detection results of AP=59.3% on COCO val set, which surpasses H-Deformable-DETR + Swin-L by +1.4%. Last, we generate a series of synthetic pre-training datasets by combining the very recent image-to-text captioning models (LLaVA) and text-to-image generative models (SDXL). Notably, pre-training on these synthetic datasets leads to notable improvements in object detection performance. Looking ahead, we anticipate substantial advantages through the future expansion of the synthetic pre-training dataset.

객체 탐지를 위한 DETR 사전 학습 재고찰

Revisiting DETR Pre-training for Object Detection

초록

Support