多模态分类器用于开放词汇物体检测

摘要

本文旨在进行开放词汇物体检测（OVOD）- 构建一个能够检测超出训练中所见类别范围的模型，从而使用户能够在推断时指定感兴趣的类别，而无需重新训练模型。我们采用标准的两阶段物体检测器架构，并探索三种指定新类别的方法：通过语言描述，通过图像示例，或者两者结合。我们做出三点贡献：首先，我们促使一个大型语言模型（LLM）生成信息丰富的物体类别语言描述，并构建强大的基于文本的分类器；其次，我们在图像示例上采用视觉聚合器，可以接受任意数量的图像作为输入，形成基于视觉的分类器；第三，我们提供一种简单的方法来融合语言描述和图像示例的信息，得到一个多模态分类器。在具有挑战性的LVIS开放词汇基准测试中进行评估时，我们展示了：（i）我们的基于文本的分类器优于所有先前的OVOD工作；（ii）我们的基于视觉的分类器在先前的工作中与基于文本的分类器表现一样好；（iii）使用多模态分类器比任一单一模态表现更好；最后，（iv）我们的基于文本和多模态分类器的性能优于完全监督的检测器。

English

The goal of this paper is open-vocabulary object detection (OVOD) x2013 building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.

多模态分类器用于开放词汇物体检测

Multi-Modal Classifiers for Open-Vocabulary Object Detection

摘要

Support