UniPixel:面向像素级视觉推理的统一目标指代与分割
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
September 22, 2025
作者: Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
cs.AI
摘要
近期,大型多模态模型(LMMs)的进展彰显了其作为通用多模态助手的显著成功,特别是在整体图像与视频语言理解方面。然而,对于扩展细粒度像素级理解能力的研究关注较少,这类能力要求模型实现视觉信号与语言语义间的像素级对齐。以往一些研究已将LMMs应用于区域级描述和指代表达分割等相关任务,但这些模型仅限于独立执行指代或分割任务,未能将这些细粒度感知能力整合进视觉推理中。为填补这一空白,我们提出了UniPixel,一个能够灵活理解视觉提示输入并生成基于掩码响应的大型多模态模型。我们的模型独特之处在于无缝整合了像素级感知与通用视觉理解能力。具体而言,UniPixel处理视觉提示并按需生成相关掩码,在推理过程中基于这些中间指针进行后续推理,从而实现了细粒度的像素级推理。我们方法的有效性已在涵盖像素级指代/分割及图像/视频中对象中心理解等多样化任务的10个基准测试中得到验证。此外,还设计了一个新颖的PixelQA任务,该任务联合要求指代、分割和问答,以验证我们方法的灵活性。
English
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their
remarkable success as general-purpose multi-modal assistants, with particular
focuses on holistic image- and video-language understanding. Conversely, less
attention has been given to scaling fine-grained pixel-level understanding
capabilities, where the models are expected to realize pixel-level alignment
between visual signals and language semantics. Some previous studies have
applied LMMs to related tasks such as region-level captioning and referring
expression segmentation. However, these models are limited to performing either
referring or segmentation tasks independently and fail to integrate these
fine-grained perception capabilities into visual reasoning. To bridge this gap,
we propose UniPixel, a large multi-modal model capable of flexibly
comprehending visual prompt inputs and generating mask-grounded responses. Our
model distinguishes itself by seamlessly integrating pixel-level perception
with general visual understanding capabilities. Specifically, UniPixel
processes visual prompts and generates relevant masks on demand, and performs
subsequent reasoning conditioning on these intermediate pointers during
inference, thereby enabling fine-grained pixel-level reasoning. The
effectiveness of our approach has been verified on 10 benchmarks across a
diverse set of tasks, including pixel-level referring/segmentation and
object-centric understanding in images/videos. A novel PixelQA task that
jointly requires referring, segmentation, and question answering is also
designed to verify the flexibility of our method.