MMDiff：擴展擴散變換器以實現多模態生成

摘要

擴散變壓器已展現出卓越的生成能力，然而在其去噪軌跡中計算出的豐富感知表徵，一旦內容渲染完成便被丟棄。我們提出 MMDiff，這是一個將凍結的擴散變壓器轉變為多模態生成系統的框架，能夠利用輕量級解碼器頭，聯合生成圖像以及任意組合的密集感知模態。我們的核心發現是，感知信息在去噪軌跡中呈現時間分佈，且採用具有空間可變聚合權重的多時間步特徵融合至關重要，這能將語義分割結果較單時間步提取提升高達 28.7% 的平均交併比。我們進一步採用概念驅動的注意力提取以實現可解釋的空間引導，並證明凍結的擴散特徵在性能上可與 DINOv3 等最先進編碼器競爭，且具有互補性。通過僅在凍結的骨幹網絡上訓練輕量級解碼器頭，我們在語義分割、顯著物體檢測和深度估計方面取得了優異表現，並證明該框架能實現大規模的有效合成數據生成。

English

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.