Kaleido Diffusion: 自己回帰的潜在モデリングによる条件付き拡散モデルの改善

要旨

拡散モデルは、テキスト記述から高品質な画像を生成する強力なツールとして登場しました。しかし、これらのモデルは、特に高い分類器不要ガイダンス重みでサンプリングする場合、生成される画像の多様性が限られることがよくあります。この問題に対処するため、我々はKaleidoを提案します。これは、自己回帰的な潜在事前分布を組み込むことでサンプルの多様性を向上させる新しいアプローチです。Kaleidoは、元のキャプションをエンコードし、潜在変数を生成する自己回帰言語モデルを統合します。これらの潜在変数は、画像生成プロセスをガイドし促進するための抽象的で中間的な表現として機能します。本論文では、テキスト記述、検出バウンディングボックス、オブジェクトブロブ、視覚的トークンなど、さまざまな離散潜在表現を探求します。これらの表現は、拡散モデルへの入力条件を多様化し豊かにし、より多様な出力を可能にします。実験結果は、Kaleidoが与えられたテキスト記述から生成される画像サンプルの多様性を効果的に広げながら、高い画像品質を維持することを示しています。さらに、Kaleidoが生成された潜在変数によって提供されるガイダンスに密接に従い、画像生成プロセスを効果的に制御および指示する能力を示しています。

English

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.

Kaleido Diffusion: 自己回帰的潜在モデリングによる条件付き拡散モデルの改善

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

要旨

Support