开放词汇物体检测的扩展
Scaling Open-Vocabulary Object Detection
June 16, 2023
作者: Matthias Minderer, Alexey Gritsenko, Neil Houlsby
cs.AI
摘要
开放词汇物体检测已经从预训练的视觉-语言模型中受益良多,但仍受到可用检测训练数据量的限制。虽然可以通过使用网络图像-文本对作为弱监督来扩展检测训练数据,但这在可与图像级预训练相媲美的规模上尚未实现。在这里,我们通过自训练来扩展检测数据,利用现有的检测器在图像-文本对上生成伪框注释。自训练扩展的主要挑战在于标签空间的选择、伪注释过滤和训练效率。我们提出了OWLv2模型和OWL-ST自训练方法,以解决这些挑战。OWLv2在可比较的训练规模(约10M个示例)上已经超越了先前最先进的开放词汇检测器的性能。然而,通过OWL-ST,我们可以扩展到超过10亿个示例,带来进一步的显著改进:在L/14架构下,OWL-ST将LVIS稀有类别的AP从31.2%提高到44.6%(相对改进43%),其中模型没有见过人工框注释。OWL-ST为开放世界定位解锁了Web规模的训练,类似于图像分类和语言建模所见到的情况。
English
Open-vocabulary object detection has benefited greatly from pretrained
vision-language models, but is still limited by the amount of available
detection training data. While detection training data can be expanded by using
Web image-text pairs as weak supervision, this has not been done at scales
comparable to image-level pretraining. Here, we scale up detection data with
self-training, which uses an existing detector to generate pseudo-box
annotations on image-text pairs. Major challenges in scaling self-training are
the choice of label space, pseudo-annotation filtering, and training
efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which
address these challenges. OWLv2 surpasses the performance of previous
state-of-the-art open-vocabulary detectors already at comparable training
scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples,
yielding further large improvement: With an L/14 architecture, OWL-ST improves
AP on LVIS rare classes, for which the model has seen no human box annotations,
from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale
training for open-world localization, similar to what has been seen for image
classification and language modelling.