오픈-보커블러리 객체 탐지의 확장

초록

오픈-보커블러리 객체 탐지는 사전 학습된 시각-언어 모델로부터 큰 혜택을 받았지만, 여전히 이용 가능한 탐지 학습 데이터의 양에 의해 제한받고 있다. 탐지 학습 데이터는 웹 이미지-텍스트 쌍을 약한 감독으로 사용하여 확장할 수 있지만, 이는 이미지 수준의 사전 학습과 비교할 만한 규모로 이루어지지 않았다. 여기서 우리는 기존 탐지기를 사용하여 이미지-텍스트 쌍에 대한 가상 박스 주석을 생성하는 자기 학습을 통해 탐지 데이터를 확장한다. 자기 학습을 확장하는 데 있어 주요 과제는 레이블 공간 선택, 가상 주석 필터링, 그리고 학습 효율성이다. 우리는 이러한 과제를 해결하는 OWLv2 모델과 OWL-ST 자기 학습 레시피를 제시한다. OWLv2는 비교 가능한 학습 규모(~10M 예시)에서 이미 이전의 최첨단 오픈-보커블러리 탐지기의 성능을 능가한다. 그러나 OWL-ST를 통해 1B 이상의 예시로 확장할 수 있으며, 이는 더 큰 개선을 가져온다: L/14 아키텍처를 사용할 때, OWL-ST는 인간 박스 주석을 전혀 보지 못한 LVIS 희귀 클래스에 대한 AP를 31.2%에서 44.6%로 개선한다(43% 상대적 개선). OWL-ST는 이미지 분류와 언어 모델링에서 볼 수 있었던 것과 유사하게, 오픈-월드 위치 지정을 위한 웹 규모의 학습을 가능하게 한다.

English

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

오픈-보커블러리 객체 탐지의 확장

Scaling Open-Vocabulary Object Detection

초록

Support