視覺內容提示

摘要

在大型語言模型（LLMs）中的上下文提示已成為改善零-shot能力的普遍方法，但這個想法在視覺領域中探索較少。現有的視覺提示方法專注於參考分割以分割最相關的物體，但無法應對許多通用視覺任務，如開放集分割和檢測。在本文中，我們為這兩個任務引入了一個通用的視覺上下文提示框架。具體來說，我們在編碼器-解碼器架構的基礎上進行構建，並開發了一個多功能提示編碼器，以支持各種提示，如筆劃、框和點。我們進一步增強它，以接受任意數量的參考圖像片段作為上下文。我們的廣泛探索表明，所提出的視覺上下文提示引出了非凡的參考和通用分割能力，用於參考和檢測，產生了與閉集內域數據集相競爭的性能，並在許多開放集分割數據集上顯示出有希望的結果。通過在COCO和SA-1B上進行聯合訓練，我們的模型在COCO上實現了57.7 PQ，在ADE20K上實現了23.2 PQ。代碼將在https://github.com/UX-Decoder/DINOv 上提供。

English

In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.

視覺內容提示

Visual In-Context Prompting

摘要

Support