ARGenSeg:基于自回归图像生成模型的分割方法
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model
October 23, 2025
作者: Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng, Jingdong Chen, Jun Zhou
cs.AI
摘要
我们提出了一种新颖的自回归生成式图像分割范式(ARGenSeg),在统一框架内实现了多模态理解与像素级感知。现有将图像分割融入多模态大语言模型(MLLM)的研究通常采用边界点表示或专用分割头,这些方法依赖离散表征或输入任务特定解码器的语义提示,限制了MLLM捕捉细粒度视觉细节的能力。为解决这些问题,我们引入了一种基于图像生成的MLLM分割框架,能够自然生成目标对象的密集掩码。我们利用MLLM输出视觉标记,并通过通用VQ-VAE解码为图像,使分割完全依赖于MLLM的像素级理解。为降低推理延迟,采用下一尺度预测策略并行生成所需视觉标记。大量实验表明,本方法在多个分割数据集上超越现有最优方法,推理速度显著提升,同时保持强大的理解能力。
English
We propose a novel AutoRegressive Generation-based paradigm for image
Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level
perception within a unified framework. Prior works integrating image
segmentation into multimodal large language models (MLLMs) typically employ
either boundary points representation or dedicated segmentation heads. These
methods rely on discrete representations or semantic prompts fed into
task-specific decoders, which limits the ability of the MLLM to capture
fine-grained visual details. To address these challenges, we introduce a
segmentation framework for MLLM based on image generation, which naturally
produces dense masks for target objects. We leverage MLLM to output visual
tokens and detokenize them into images using an universal VQ-VAE, making the
segmentation fully dependent on the pixel-level understanding of the MLLM. To
reduce inference latency, we employ a next-scale-prediction strategy to
generate required visual tokens in parallel. Extensive experiments demonstrate
that our method surpasses prior state-of-the-art approaches on multiple
segmentation datasets with a remarkable boost in inference speed, while
maintaining strong understanding capabilities.