视觉上下文提示
Visual In-Context Prompting
November 22, 2023
作者: Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao
cs.AI
摘要
在大型语言模型(LLMs)中的上下文提示已成为改善零翻译能力的一种普遍方法,但这个想法在视觉领域的探索较少。现有的视觉提示方法侧重于参考分割以分割最相关的对象,但未能解决许多通用视觉任务,如开放集分割和检测。在本文中,我们为这两个任务引入了一种通用的视觉上下文提示框架。具体来说,我们基于编码器-解码器架构,并开发了一个多功能提示编码器,以支持各种提示,如笔画、框和点。我们进一步增强了它,以接受任意数量的参考图像片段作为上下文。我们的广泛探索表明,所提出的视觉上下文提示引发了非凡的指代和通用分割能力,用于指代和检测,产生了与封闭集内领域数据集竞争性能相当的表现,并在许多开放集分割数据集上展现了有希望的结果。通过在COCO和SA-1B上联合训练,我们的模型在COCO上达到了57.7 PQ,在ADE20K上达到了23.2 PQ。代码将在https://github.com/UX-Decoder/DINOv 上提供。
English
In-context prompting in large language models (LLMs) has become a prevalent
approach to improve zero-shot capabilities, but this idea is less explored in
the vision domain. Existing visual prompting methods focus on referring
segmentation to segment the most relevant object, falling short of addressing
many generic vision tasks like open-set segmentation and detection. In this
paper, we introduce a universal visual in-context prompting framework for both
tasks. In particular, we build on top of an encoder-decoder architecture, and
develop a versatile prompt encoder to support a variety of prompts like
strokes, boxes, and points. We further enhance it to take an arbitrary number
of reference image segments as the context. Our extensive explorations show
that the proposed visual in-context prompting elicits extraordinary referring
and generic segmentation capabilities to refer and detect, yielding competitive
performance to close-set in-domain datasets and showing promising results on
many open-set segmentation datasets. By joint training on COCO and SA-1B, our
model achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be
available at https://github.com/UX-Decoder/DINOv.