ARGenSeg：基于自回归图像生成模型的分割方法

摘要

我们提出了一种新颖的自回归生成式图像分割范式（ARGenSeg），在统一框架内实现了多模态理解与像素级感知。现有将图像分割融入多模态大语言模型（MLLM）的研究通常采用边界点表示或专用分割头，这些方法依赖离散表征或输入任务特定解码器的语义提示，限制了MLLM捕捉细粒度视觉细节的能力。为解决这些问题，我们引入了一种基于图像生成的MLLM分割框架，能够自然生成目标对象的密集掩码。我们利用MLLM输出视觉标记，并通过通用VQ-VAE解码为图像，使分割完全依赖于MLLM的像素级理解。为降低推理延迟，采用下一尺度预测策略并行生成所需视觉标记。大量实验表明，本方法在多个分割数据集上超越现有最优方法，推理速度显著提升，同时保持强大的理解能力。

English

We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.

ARGenSeg：基于自回归图像生成模型的分割方法

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

摘要

Support