Kaleido扩散：通过自回归潜在建模改进条件扩散模型

摘要

扩散模型已成为从文本描述生成高质量图像的强大工具。尽管取得了成功，但这些模型在采样图像时往往表现出有限的多样性，特别是在采样时使用高分类器无关的引导权重时。为了解决这个问题，我们提出了Kaleido，一种通过整合自回归潜在先验来增强样本多样性的新方法。Kaleido集成了一个自回归语言模型，对原始标题进行编码并生成潜在变量，作为引导和促进图像生成过程的抽象和中间表示。在本文中，我们探讨了各种离散潜在表示，包括文本描述、检测边界框、对象斑块和视觉标记。这些表示使扩散模型的输入条件多样化和丰富化，从而实现更多样化的输出。我们的实验结果表明，Kaleido有效地扩展了从给定文本描述生成的图像样本的多样性，同时保持了高图像质量。此外，我们展示了Kaleido密切遵循生成的潜在变量提供的引导，展示了其有效控制和指导图像生成过程的能力。

English

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.

Kaleido扩散：通过自回归潜在建模改进条件扩散模型

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

摘要

Support