시각적 인-컨텍스트 프롬프팅

초록

대규모 언어 모델(LLMs)에서의 인컨텍스트 프롬프팅은 제로샷 능력을 향상시키기 위한 일반적인 접근 방식으로 자리 잡았지만, 이 아이디어는 비전 도메인에서는 덜 탐구되었습니다. 기존의 시각적 프롬프팅 방법들은 가장 관련성이 높은 객체를 분할하기 위한 참조 분할(referring segmentation)에 초점을 맞추고 있어, 오픈셋 분할 및 탐지와 같은 다양한 일반적인 비전 작업을 다루는 데는 한계가 있습니다. 본 논문에서는 이러한 두 가지 작업을 모두 위한 범용 시각적 인컨텍스트 프롬프팅 프레임워크를 소개합니다. 특히, 인코더-디코더 아키텍처를 기반으로 하여 스트로크, 박스, 점과 같은 다양한 프롬프트를 지원하는 다용도 프롬프트 인코더를 개발했습니다. 또한, 임의의 수의 참조 이미지 세그먼트를 컨텍스트로 사용할 수 있도록 이를 더욱 강화했습니다. 광범위한 실험을 통해 제안된 시각적 인컨텍스트 프롬프팅이 참조 및 일반 분할 능력을 극대화하여 참조 및 탐지 작업에서 경쟁력 있는 성능을 보여주며, 클로즈셋 인도메인 데이터셋에서 우수한 성과를 거두고 많은 오픈셋 분할 데이터셋에서도 유망한 결과를 보여줌을 확인했습니다. COCO와 SA-1B 데이터셋에 대한 공동 학습을 통해, 우리의 모델은 COCO에서 57.7 PQ, ADE20K에서 23.2 PQ를 달성했습니다. 코드는 https://github.com/UX-Decoder/DINOv에서 제공될 예정입니다.

English

In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv.

시각적 인-컨텍스트 프롬프팅

Visual In-Context Prompting

초록

Support