拡散モデルによる汎用セグメンテーション学習手法

要旨

拡散モデルは主に画像生成のために訓練されるが、そのノイズ除去軌道は空間的に整合した豊富な視覚的事前知識を符号化している。本論文では、これらの事前知識がテキスト条件付きセマンティックセグメンテーションおよびオープン語彙セグメンテーションに利用可能であり、このアプローチが様々な下流タスクに一般化して汎用拡散セグメンテーション枠組みを構築できることを示す。具体的には、事前学習済み拡散モデルを統一的セグメンテーション枠組みに転用するDiGSegを提案する。本手法は入力画像と正解マスクを潜在空間に符号化し、それらを拡散U-Netの条件信号として連結する。並列のCLIP連携テキスト経路が複数スケールで言語特徴を注入し、テキストクエリと発展する視覚表現の整合を可能にする。この設計により、外観と任意のテキストプロンプトの両方に条件付けされた構造化セグメンテーションマスクを生成する普遍的なインターフェースとして、既製の拡散バックボーンを変容させる。大規模実験により、標準的なセマンティックセグメンテーションベンチマークでのstate-of-the-art性能、強力なオープン語彙一般化能力、医療・リモートセンシング・農業シナリオへのドメイン横断的転移性能を実証する（ドメイン特化の構造カスタマイズなし）。これらの結果は、現代の拡散バックボーンが純粋な生成器ではなく汎用セグメンテーション学習器として機能しうることを示し、視覚生成と視覚理解の間の隔たりを縮めるものである。

English

Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.

拡散モデルによる汎用セグメンテーション学習手法

Diffusion Model as a Generalist Segmentation Learner

要旨

Support