GenMask: マスク直接予測によるDiTのセグメンテーションへの適応

要旨

近年のセグメンテーション手法では、事前学習済み生成モデルを特徴抽出器として利用し、間接的な特徴検出を通じて下流適応タスクとしてセグメンテーションを扱うアプローチが主流となっている。しかし、この暗黙的な利用法は表現の根本的な不一致という問題を抱えている。また、間接的な特徴抽出パイプラインへの依存度が高く、ワークフローを複雑化し適応性を制限している。本論文では、間接的な適応ではなく、セグメンテーションタスクを生成的な手法で直接学習すべきであると主張する。この統一的な定式化における主要な障壁として、バイナリマスクのVAE潜在表現が、自然画像の潜在表現とは異なり、急峻な分布、ノイズ頑健性、線形分離性を持つことを明らかにする。この隔たりを埋めるため、我々はバイナリマスク用のタイムステップサンプリング戦略を提案する。これは、セグメンテーションには極端なノイズレベルを、画像生成には中程度のノイズを重視することで、調和のとれた共同学習を可能にする。我々はGenMaskを発表する。これはオリジナルの生成目標の下、白黒のセグメンテーションマスクとRGB空間のカラー画像の両方を生成するように学習するDiTである。GenMaskは元のDiTアーキテクチャを保持しつつ、セグメンテーションタスクに特化した特徴抽出パイプラインを不要とする。実験では、GenMaskは参照セグメンテーションおよび推論セグメンテーションのベンチマークでState-of-the-Art性能を達成し、 ablation studyにより各構成要素の寄与を定量化した。

English

Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.

GenMask: マスク直接予測によるDiTのセグメンテーションへの適応

GenMask: Adapting DiT for Segmentation via Direct Mask

要旨

Support