InstaGen: 합성 데이터셋 학습을 통한 객체 탐지 성능 향상

초록

본 논문에서는 확산 모델(diffusion model)로 생성된 합성 데이터셋을 활용하여 객체 탐지기의 능력을 향상시키는 새로운 패러다임을 소개한다. 구체적으로, 사전 학습된 생성적 확산 모델에 인스턴스 수준의 그라운딩 헤드(grounding head)를 통합하여, 생성된 이미지 내 임의의 인스턴스를 위치 지정할 수 있는 능력을 부여한다. 이 그라운딩 헤드는 범주 이름의 텍스트 임베딩과 확산 모델의 지역적 시각적 특징을 정렬하도록 학습되며, 이는 기존의 객체 탐지기로부터의 지도와 탐지기가 다루지 않는 (새로운) 범주에 대한 새로운 자기 학습 기법을 통해 이루어진다. 이러한 개선된 확산 모델은 InstaGen으로 명명되며, 객체 탐지를 위한 데이터 합성기로 활용될 수 있다. 본 연구에서는 InstaGen으로부터 생성된 합성 데이터셋을 통해 객체 탐지기의 성능이 향상될 수 있음을 보여주는 철저한 실험을 수행하였으며, 이는 개방형 어휘(open-vocabulary) 시나리오에서 +4.5 AP, 데이터 희소(data-sparse) 시나리오에서 +1.2에서 5.2 AP의 성능 향상을 통해 기존의 최신 방법들을 능가함을 입증하였다.

English

In this paper, we introduce a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising arbitrary instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. This enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer for object detection. We conduct thorough experiments to show that, object detector can be enhanced while training on the synthetic dataset from InstaGen, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios.

InstaGen: 합성 데이터셋 학습을 통한 객체 탐지 성능 향상

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

초록

Support