MMDiff: マルチモーダル生成のための拡散トランスフォーマーの拡張

要旨

拡散トランスフォーマーは顕著な生成能力を示しているが、そのノイズ除去軌跡に沿って計算される豊かな知覚表現は、コンテンツがレンダリングされると破棄される。本稿では、凍結された拡散トランスフォーマーを、軽量なデコーダヘッドを用いて画像と任意の組み合わせの密な知覚モダリティを同時に生成するマルチモーダル生成システムに変換するフレームワークMMDiffを提案する。我々の中心的な発見は、知覚情報がノイズ除去軌跡に沿って時間的に分散しており、空間的に変動する集約重みを用いた複数タイムステップの特徴融合が不可欠であり、単一タイムステップの抽出と比較してセマンティックセグメンテーションの結果を最大28.7% mIoU向上させるという点である。さらに、解釈可能な空間的ガイダンスのために概念駆動型注意抽出を採用し、凍結された拡散特徴がDINOv3などの最先端エンコーダと競合し、かつ相補的であることを示す。凍結されたバックボーン上で軽量なデコーダヘッドのみを訓練することにより、セマンティックセグメンテーション、顕著物体検出、深度推定において強力な性能を達成し、このフレームワークが大規模な合成データ生成を効果的に可能にすることを実証する。

English

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.