ChatPaper.aiChatPaper

Kaleido Diffusion:透過自回歸潛在建模改進條件擴散模型

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

May 31, 2024
作者: Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind
cs.AI

摘要

擴散模型已成為從文字描述生成高質量圖像的強大工具。儘管取得成功,這些模型在採樣圖像時往往表現出有限的多樣性,特別是在使用高分類器自由引導權重進行採樣時。為解決此問題,我們提出了Kaleido,一種通過整合自回歸潛在先驗來增強樣本多樣性的新方法。Kaleido整合了一個自回歸語言模型,將原始標題編碼並生成潛在變量,作為引導和促進圖像生成過程的抽象和中介表示。在本文中,我們探索了各種離散潛在表示,包括文字描述、檢測邊界框、對象區塊和視覺標記。這些表示多樣化並豐富了輸入條件到擴散模型,從而實現更多樣化的輸出。我們的實驗結果表明,Kaleido有效地擴大了從給定文字描述生成的圖像樣本的多樣性,同時保持高質量圖像。此外,我們展示了Kaleido緊密遵循生成的潛在變量提供的引導,展示了其有效控制和指導圖像生成過程的能力。
English
Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.

Summary

AI-Generated Summary

PDF160December 12, 2024