開放詞彙物件檢測的擴展性

摘要

開放詞彙物體偵測已經從預先訓練的視覺語言模型中受益良多，但仍受限於可用的偵測訓練數據量。雖然可以通過使用網絡圖像-文本對作為弱監督來擴展偵測訓練數據，但這在與圖像級預訓練相比的規模上尚未實現。在這裡，我們通過自我訓練來擴展偵測數據，該方法使用現有的檢測器在圖像-文本對上生成虛擬框標註。在擴展自我訓練時的主要挑戰包括標籤空間的選擇、虛擬標註篩選和訓練效率。我們提出了 OWLv2 模型和 OWL-ST 自我訓練配方，以應對這些挑戰。OWLv2 在可比較的訓練規模（約 1000 萬個示例）下超越了先前最先進的開放詞彙檢測器的性能。然而，通過 OWL-ST，我們可以擴展到超過 10 億個示例，進一步取得了巨大的改進：在 L/14 結構下，OWL-ST 將對 LVIS 稀有類別的 AP 從 31.2% 提高到 44.6%（相對改進 43%）。OWL-ST 為開放世界定位解鎖了 Web 規模的訓練，類似於圖像分類和語言建模所見的情況。

English

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

開放詞彙物件檢測的擴展性

Scaling Open-Vocabulary Object Detection

摘要

Support