GenMask：基于直接掩码的DiT自适应分割方法

摘要

近期分割方法普遍采用预训练生成模型作为特征提取器，通过间接特征检索将分割视为下游适配任务。这种隐式运用存在表征层面的根本性错位，且高度依赖复杂的间接特征提取流程，既增加了工作流复杂度又限制了适配能力。本文主张分割任务应以生成式方法进行直接训练，而非间接适配。我们发现实现该统一框架的核心障碍在于：二值掩码的VAE潜在空间具有分布尖锐、抗噪性强且线性可分的特性，与自然图像潜在表征存在显著差异。为弥合这一差距，我们提出针对二值掩码的时间步采样策略——对分割任务侧重极端噪声水平，对图像生成采用适中噪声，从而实现和谐的联合训练。我们提出GenMask模型，该DiT架构在原始生成目标下训练，可同时生成RGB空间的彩色图像与黑白分割掩码。GenMask在保留原始DiT架构的同时，无需针对分割任务定制特征提取流程。实验表明，GenMask在指代分割与推理分割基准测试中达到最先进性能，消融实验则量化了各组件的贡献度。

English

Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.