MMDiff：扩展扩散Transformer以实现多模态生成

摘要

扩散变换器已展现出卓越的生成能力，但在内容渲染后，其去噪轨迹中计算出的丰富感知表征却被丢弃。我们提出MMDiff框架，将冻结的扩散变换器转化为多模态生成系统，可利用轻量级解码器头同时生成图像及任意组合的密集感知模态。核心发现是：感知信息沿去噪轨迹呈时间分布特性，采用空间变化聚合权重的多时间步特征融合至关重要，相较于单时间步提取，语义分割结果平均交并比（mIoU）最高可提升28.7%。我们进一步采用概念驱动注意力提取实现可解释的空间引导，并证明冻结扩散特征与DINOv3等先进编码器相比具有竞争力且相互补充。通过仅在冻结主干网络上训练轻量级解码器头，我们在语义分割、显著目标检测和深度估计任务中取得优异性能，并验证该框架可有效支持大规模合成数据生成。

English

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.