現実世界の画像バリエーションを拡散反転チェーンのアライメントによって生成

要旨

近年の拡散モデルの進化により、テキストプロンプトを用いた高精細な画像生成が可能となった。しかし、生成画像と実世界の画像の間にはドメインギャップが存在し、実世界画像の高品質なバリエーション生成において課題となっている。本研究では、このドメインギャップが異なる拡散プロセスにおける潜在変数の分布の違いに起因することを明らかにした。この問題を解決するため、我々はReal-world Image Variation by ALignment (RIVAL)と呼ばれる新しい推論パイプラインを提案する。このパイプラインは、拡散モデルを利用して単一の画像例から画像バリエーションを生成するものである。我々のパイプラインは、画像生成プロセスをソース画像の逆変換チェーンに整合させることで、画像バリエーションの生成品質を向上させる。具体的には、ステップごとの潜在変数分布の整合が高品質なバリエーション生成に不可欠であることを示す。これを実現するため、特徴量相互作用のためのクロスイメージ自己注意注入と、潜在特徴量を整合させるためのステップごとの分布正規化を設計した。これらの整合プロセスを拡散モデルに組み込むことで、RIVALは追加のパラメータ最適化なしで高品質な画像バリエーションを生成できる。実験結果は、提案手法が既存の手法を上回るセマンティック条件類似性と知覚品質を達成することを示している。さらに、この汎用的な推論パイプラインは、画像条件付きテキストから画像への生成や例に基づく画像修復といった他の拡散ベースの生成タスクにも容易に適用可能である。

English

Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods with respect to semantic-condition similarity and perceptual quality. Furthermore, this generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and example-based image inpainting.

現実世界の画像バリエーションを拡散反転チェーンのアライメントによって生成

Real-World Image Variation by Aligning Diffusion Inversion Chain

要旨

Support