MMDiff: 멀티모달 생성을 위한 확산 트랜스포머 확장

초록

확산 트랜스포머는 뛰어난 생성 능력을 보여주었지만, 잡음 제거 궤적 전반에 걸쳐 계산된 풍부한 인지적 표현은 콘텐츠가 렌더링된 후 폐기된다. 본 논문에서는 MMDiff를 제안한다. 이 프레임워크는 고정된 확산 트랜스포머를 다중 모드 생성 시스템으로 변환하여, 가벼운 디코더 헤드를 통해 이미지와 함께 임의의 조밀한 인지 양식을 결합하여 공동으로 생성한다. 우리의 핵심 발견은 인지적 정보가 잡음 제거 궤적을 따라 시간적으로 분포되어 있으며, 공간적으로 변하는 집계 가중치를 사용한 다중 시간 단계 특징 융합이 필수적이라는 점이다. 이는 단일 시간 단계 추출에 비해 의미 분할 결과를 최대 28.7% mIoU까지 향상시킨다. 또한 해석 가능한 공간적 안내를 위해 개념 기반 어텐션 추출을 도입하였으며, 고정된 확산 특징이 DINOv3와 같은 최첨단 인코더와 경쟁력이 있을 뿐만 아니라 상호 보완적임을 보여준다. 고정된 백본에 대해 가벼운 디코더 헤드만 학습하여 의미 분할, 현저 객체 검출, 깊이 추정에서 강력한 성능을 달성하였으며, 이 프레임워크가 대규모 합성 데이터 생성을 효과적으로 가능하게 함을 입증한다.

English

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.