ChatPaper.aiChatPaper

视觉上下文提示

Visual In-Context Prompting

November 22, 2023
作者: Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe Xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao
cs.AI

摘要

大型语言模型(LLM)中的上下文提示已成为提升零样本能力的常用方法,但这一思路在视觉领域的探索尚不充分。现有视觉提示方法主要聚焦于指代分割任务以分割最相关的对象,却难以应对开放集分割与检测等通用视觉任务。本文提出了一种适用于上述两类任务的通用视觉上下文提示框架。具体而言,我们在编码器-解码器架构基础上构建了支持多种提示(如线条、框选、点选)的通用提示编码器,并进一步扩展其功能以接受任意数量的参考图像片段作为上下文。大量实验表明,所提出的视觉上下文提示方法能够激发卓越的指代分割与通用分割能力,在封闭集领域数据集上达到具有竞争力的性能,并在多个开放集分割数据集上展现出良好效果。通过联合训练COCO和SA-1B数据集,我们的模型在COCO上达到57.7 PQ,在ADE20K上达到23.2 PQ。代码将发布于https://github.com/UX-Decoder/DINOv。
English
In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.
PDF182March 22, 2026