視覺內容提示
Visual In-Context Prompting
November 22, 2023
作者: Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao
cs.AI
摘要
在大型語言模型(LLMs)中的上下文提示已成為改善零-shot能力的普遍方法,但這個想法在視覺領域中探索較少。現有的視覺提示方法專注於參考分割以分割最相關的物體,但無法應對許多通用視覺任務,如開放集分割和檢測。在本文中,我們為這兩個任務引入了一個通用的視覺上下文提示框架。具體來說,我們在編碼器-解碼器架構的基礎上進行構建,並開發了一個多功能提示編碼器,以支持各種提示,如筆劃、框和點。我們進一步增強它,以接受任意數量的參考圖像片段作為上下文。我們的廣泛探索表明,所提出的視覺上下文提示引出了非凡的參考和通用分割能力,用於參考和檢測,產生了與閉集內域數據集相競爭的性能,並在許多開放集分割數據集上顯示出有希望的結果。通過在COCO和SA-1B上進行聯合訓練,我們的模型在COCO上實現了57.7 PQ,在ADE20K上實現了23.2 PQ。代碼將在https://github.com/UX-Decoder/DINOv 上提供。
English
In-context prompting in large language models (LLMs) has become a prevalent
approach to improve zero-shot capabilities, but this idea is less explored in
the vision domain. Existing visual prompting methods focus on referring
segmentation to segment the most relevant object, falling short of addressing
many generic vision tasks like open-set segmentation and detection. In this
paper, we introduce a universal visual in-context prompting framework for both
tasks. In particular, we build on top of an encoder-decoder architecture, and
develop a versatile prompt encoder to support a variety of prompts like
strokes, boxes, and points. We further enhance it to take an arbitrary number
of reference image segments as the context. Our extensive explorations show
that the proposed visual in-context prompting elicits extraordinary referring
and generic segmentation capabilities to refer and detect, yielding competitive
performance to close-set in-domain datasets and showing promising results on
many open-set segmentation datasets. By joint training on COCO and SA-1B, our
model achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be
available at https://github.com/UX-Decoder/DINOv.