多模式分類器用於開放詞彙的物體檢測

摘要

本文旨在進行開放詞彙物體偵測（OVOD）- 建立一個能夠偵測超出訓練中所見類別的物體的模型，從而使用戶在推論時能夠指定感興趣的類別而無需重新訓練模型。我們採用標準的兩階段物體偵測器架構，並探索三種指定新類別的方法：通過語言描述、通過圖像實例，或者兩者結合。我們做出三項貢獻：首先，我們促使一個大型語言模型（LLM）生成對物體類別的資訊豐富的語言描述，並構建強大的基於文本的分類器；其次，我們在圖像實例上使用視覺聚合器，可以接受任意數量的圖像作為輸入，形成基於視覺的分類器；第三，我們提供一個簡單的方法來融合語言描述和圖像實例的信息，產生多模態分類器。在具有挑戰性的LVIS開放詞彙基準測試中，我們展示：（i）我們的基於文本的分類器優於所有先前的OVOD作品；（ii）我們的基於視覺的分類器在先前的工作中與基於文本的分類器表現一樣好；（iii）使用多模態分類器比單一模態表現更好；最後，（iv）我們的基於文本和多模態分類器比完全監督的偵測器表現更好。

English

The goal of this paper is open-vocabulary object detection (OVOD) x2013 building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector.

多模式分類器用於開放詞彙的物體檢測

Multi-Modal Classifiers for Open-Vocabulary Object Detection

摘要

Support