統一多模態離散擴散

摘要

多模態生成模型能夠理解和跨越多種模態進行生成，目前主要由自迴歸（AR）方法主導，這些方法從左到右或從上到下依次處理標記。這些模型共同處理圖像、文本、視頻和音頻，用於各種任務，如圖像描述、問答和圖像生成。在本研究中，我們探索離散擴散模型作為聯合文本和圖像領域的統一生成框架，基於其在文本生成中的最新成功。離散擴散模型相比AR模型具有多項優勢，包括對生成樣本質量與多樣性的更好控制、能夠執行聯合多模態修補（跨越文本和圖像領域），以及通過指導實現更強的生成可控性。利用這些優勢，我們提出了首個統一多模態離散擴散（UniDisc）模型，該模型能夠聯合理解和生成文本和圖像，適用於多種下游任務。我們將UniDisc與多模態AR模型進行比較，進行了規模分析，並證明UniDisc在性能和推理時間計算、增強的可控性、可編輯性、修補能力以及推理時間與生成質量之間的靈活權衡方面均優於它們。代碼和更多可視化內容可在https://unidisc.github.io獲取。

English

Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at https://unidisc.github.io.