開放詞彙物件檢測的擴展性
Scaling Open-Vocabulary Object Detection
June 16, 2023
作者: Matthias Minderer, Alexey Gritsenko, Neil Houlsby
cs.AI
摘要
開放詞彙物體偵測已經從預先訓練的視覺語言模型中受益良多,但仍受限於可用的偵測訓練數據量。雖然可以通過使用網絡圖像-文本對作為弱監督來擴展偵測訓練數據,但這在與圖像級預訓練相比的規模上尚未實現。在這裡,我們通過自我訓練來擴展偵測數據,該方法使用現有的檢測器在圖像-文本對上生成虛擬框標註。在擴展自我訓練時的主要挑戰包括標籤空間的選擇、虛擬標註篩選和訓練效率。我們提出了 OWLv2 模型和 OWL-ST 自我訓練配方,以應對這些挑戰。OWLv2 在可比較的訓練規模(約 1000 萬個示例)下超越了先前最先進的開放詞彙檢測器的性能。然而,通過 OWL-ST,我們可以擴展到超過 10 億個示例,進一步取得了巨大的改進:在 L/14 結構下,OWL-ST 將對 LVIS 稀有類別的 AP 從 31.2% 提高到 44.6%(相對改進 43%)。OWL-ST 為開放世界定位解鎖了 Web 規模的訓練,類似於圖像分類和語言建模所見的情況。
English
Open-vocabulary object detection has benefited greatly from pretrained
vision-language models, but is still limited by the amount of available
detection training data. While detection training data can be expanded by using
Web image-text pairs as weak supervision, this has not been done at scales
comparable to image-level pretraining. Here, we scale up detection data with
self-training, which uses an existing detector to generate pseudo-box
annotations on image-text pairs. Major challenges in scaling self-training are
the choice of label space, pseudo-annotation filtering, and training
efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which
address these challenges. OWLv2 surpasses the performance of previous
state-of-the-art open-vocabulary detectors already at comparable training
scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples,
yielding further large improvement: With an L/14 architecture, OWL-ST improves
AP on LVIS rare classes, for which the model has seen no human box annotations,
from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale
training for open-world localization, similar to what has been seen for image
classification and language modelling.