Kaleido扩散:通过自回归潜在建模改进条件扩散模型
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
May 31, 2024
作者: Jiatao Gu, Ying Shen, Shuangfei Zhai, Yizhe Zhang, Navdeep Jaitly, Joshua M. Susskind
cs.AI
摘要
扩散模型已成为从文本描述生成高质量图像的强大工具。尽管取得了成功,但这些模型在采样图像时往往表现出有限的多样性,特别是在采样时使用高分类器无关的引导权重时。为了解决这个问题,我们提出了Kaleido,一种通过整合自回归潜在先验来增强样本多样性的新方法。Kaleido集成了一个自回归语言模型,对原始标题进行编码并生成潜在变量,作为引导和促进图像生成过程的抽象和中间表示。在本文中,我们探讨了各种离散潜在表示,包括文本描述、检测边界框、对象斑块和视觉标记。这些表示使扩散模型的输入条件多样化和丰富化,从而实现更多样化的输出。我们的实验结果表明,Kaleido有效地扩展了从给定文本描述生成的图像样本的多样性,同时保持了高图像质量。此外,我们展示了Kaleido密切遵循生成的潜在变量提供的引导,展示了其有效控制和指导图像生成过程的能力。
English
Diffusion models have emerged as a powerful tool for generating high-quality
images from textual descriptions. Despite their successes, these models often
exhibit limited diversity in the sampled images, particularly when sampling
with a high classifier-free guidance weight. To address this issue, we present
Kaleido, a novel approach that enhances the diversity of samples by
incorporating autoregressive latent priors. Kaleido integrates an
autoregressive language model that encodes the original caption and generates
latent variables, serving as abstract and intermediary representations for
guiding and facilitating the image generation process. In this paper, we
explore a variety of discrete latent representations, including textual
descriptions, detection bounding boxes, object blobs, and visual tokens. These
representations diversify and enrich the input conditions to the diffusion
models, enabling more diverse outputs. Our experimental results demonstrate
that Kaleido effectively broadens the diversity of the generated image samples
from a given textual description while maintaining high image quality.
Furthermore, we show that Kaleido adheres closely to the guidance provided by
the generated latent variables, demonstrating its capability to effectively
control and direct the image generation process.Summary
AI-Generated Summary