GenMask:基于直接掩码的DiT自适应分割框架
GenMask: Adapting DiT for Segmentation via Direct Mask
March 25, 2026
作者: Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang
cs.AI
摘要
近期分割方法普遍采用预训练生成模型作为特征提取器,通过间接特征检索将分割视为下游适配任务。这种隐式运用存在表征层面的根本性错位问题,且高度依赖复杂的间接特征提取流程,不仅增加了工作流复杂度,也限制了适配能力。本文主张分割任务应直接以生成式方法进行训练,而非采用间接适配策略。我们发现实现该统一框架的关键障碍在于:二值掩码的VAE潜在向量具有分布尖锐、抗噪性强且线性可分的特性,与自然图像潜在向量存在显著差异。为弥合这一差距,我们提出了针对二值掩码的时序采样策略——对分割任务侧重极端噪声水平,对图像生成则采用适度噪声,从而实现和谐的联合训练。我们提出的GenMask采用原始DiT架构,在保持生成目标不变的前提下,既能生成RGB空间的彩色图像,也能直接生成黑白分割掩码。该方法无需针对分割任务定制特征提取流程,在指代分割与推理分割基准测试中达到最先进性能,消融实验则量化了各组件的贡献度。
English
Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.