Muddit：超越文本到圖像的生成解放——基於統一離散擴散模型

摘要

統一生成模型旨在通過單一架構和解碼範式處理跨模態的多樣任務——如文本生成、圖像生成及視覺語言推理。自迴歸統一模型因序列解碼而導致推理速度緩慢，非自迴歸統一模型則因預訓練骨幹受限而泛化能力較弱。我們提出了Muddit，一種統一的離散擴散變換器，能夠在文本和圖像模態上實現快速並行生成。與以往從頭訓練的統一擴散模型不同，Muddit整合了來自預訓練文本到圖像骨幹的強大視覺先驗知識與輕量級文本解碼器，從而在統一架構下實現靈活且高質量的多模態生成。實證結果表明，Muddit在質量和效率上均達到了與顯著更大的自迴歸模型相當或更優的性能。這項工作凸顯了純離散擴散在配備強大視覺先驗知識時，作為統一生成的可擴展且有效骨幹的潛力。

English

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

Muddit：超越文本到圖像的生成解放——基於統一離散擴散模型

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

摘要

Support