ChatPaper.aiChatPaper

Muddit:超越文本到图像生成的统一离散扩散模型,开启新一代创作自由

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

May 29, 2025
作者: Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Shuangyong Song, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan
cs.AI

摘要

统一生成模型旨在通过单一架构和解码范式处理跨模态的多样化任务——如文本生成、图像生成及视觉语言推理。自回归统一模型因序列解码导致推理速度缓慢,而非自回归统一模型则因预训练骨干网络有限而泛化能力较弱。我们提出了Muddit,一种统一的离散扩散变换器,能够在文本和图像模态间实现快速并行生成。与以往从头训练的扩散模型不同,Muddit整合了预训练文本到图像骨干网络中的强大视觉先验知识,并配备轻量级文本解码器,从而在统一架构下实现灵活且高质量的多模态生成。实验结果表明,Muddit在质量和效率上均与规模显著更大的自回归模型相比具有竞争力或更优表现。该研究强调了在配备强大视觉先验的条件下,纯离散扩散作为统一生成可扩展且有效骨干网络的潜力。
English
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

Summary

AI-Generated Summary

PDF143May 30, 2025