视觉上下文提示

摘要

在大型语言模型（LLMs）中的上下文提示已成为改善零翻译能力的一种普遍方法，但这个想法在视觉领域的探索较少。现有的视觉提示方法侧重于参考分割以分割最相关的对象，但未能解决许多通用视觉任务，如开放集分割和检测。在本文中，我们为这两个任务引入了一种通用的视觉上下文提示框架。具体来说，我们基于编码器-解码器架构，并开发了一个多功能提示编码器，以支持各种提示，如笔画、框和点。我们进一步增强了它，以接受任意数量的参考图像片段作为上下文。我们的广泛探索表明，所提出的视觉上下文提示引发了非凡的指代和通用分割能力，用于指代和检测，产生了与封闭集内领域数据集竞争性能相当的表现，并在许多开放集分割数据集上展现了有希望的结果。通过在COCO和SA-1B上联合训练，我们的模型在COCO上达到了57.7 PQ，在ADE20K上达到了23.2 PQ。代码将在https://github.com/UX-Decoder/DINOv 上提供。

English

In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.

视觉上下文提示

Visual In-Context Prompting

摘要

Support