开放词汇物体检测的扩展

摘要

开放词汇物体检测已经从预训练的视觉-语言模型中受益良多，但仍受到可用检测训练数据量的限制。虽然可以通过使用网络图像-文本对作为弱监督来扩展检测训练数据，但这在可与图像级预训练相媲美的规模上尚未实现。在这里，我们通过自训练来扩展检测数据，利用现有的检测器在图像-文本对上生成伪框注释。自训练扩展的主要挑战在于标签空间的选择、伪注释过滤和训练效率。我们提出了OWLv2模型和OWL-ST自训练方法，以解决这些挑战。OWLv2在可比较的训练规模（约10M个示例）上已经超越了先前最先进的开放词汇检测器的性能。然而，通过OWL-ST，我们可以扩展到超过10亿个示例，带来进一步的显著改进：在L/14架构下，OWL-ST将LVIS稀有类别的AP从31.2%提高到44.6%（相对改进43%），其中模型没有见过人工框注释。OWL-ST为开放世界定位解锁了Web规模的训练，类似于图像分类和语言建模所见到的情况。

English

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

开放词汇物体检测的扩展

Scaling Open-Vocabulary Object Detection

摘要

Support