ノイズは拡散ガイダンスに値する

要旨

拡散モデルは高品質な画像生成に優れています。ただし、現在の拡散モデルは、分類器フリーガイダンス（CFG）などのガイダンス手法なしでは信頼性の高い画像を生成するのに苦労しています。ガイダンス手法は本当に必要なのでしょうか？拡散反転によって得られるノイズがガイダンスなしで高品質な画像を再構築できることを観察し、私たちはノイズリダクションパイプラインの初期ノイズに焦点を当てます。ガウスノイズを「ガイダンスフリーノイズ」にマッピングすることで、小さな低振幅低周波成分がノイズリダクションプロセスを大幅に向上させ、ガイダンスの必要性を取り除き、推論スループットとメモリの両方を向上させることがわかります。さらに、私たちは、初期ノイズの単一のリファインメントでガイダンス手法を置き換える新しい手法「\ours」を提案します。このリファインされたノイズにより、同じ拡散パイプライン内でガイダンスなしで高品質な画像生成が可能となります。私たちのノイズリファイニングモデルは効率的なノイズ空間学習を活用し、わずか50Kのテキスト画像ペアで迅速な収束と高いパフォーマンスを実現します。様々なメトリクスでその効果を検証し、リファインされたノイズがガイダンスの必要性を排除する方法を分析します。プロジェクトページはこちら：https://cvlab-kaist.github.io/NoiseRefine/。

English

Diffusion models excel in generating high-quality images. However, current diffusion models struggle to produce reliable images without guidance methods, such as classifier-free guidance (CFG). Are guidance methods truly necessary? Observing that noise obtained via diffusion inversion can reconstruct high-quality images without guidance, we focus on the initial noise of the denoising pipeline. By mapping Gaussian noise to `guidance-free noise', we uncover that small low-magnitude low-frequency components significantly enhance the denoising process, removing the need for guidance and thus improving both inference throughput and memory. Expanding on this, we propose \ours, a novel method that replaces guidance methods with a single refinement of the initial noise. This refined noise enables high-quality image generation without guidance, within the same diffusion pipeline. Our noise-refining model leverages efficient noise-space learning, achieving rapid convergence and strong performance with just 50K text-image pairs. We validate its effectiveness across diverse metrics and analyze how refined noise can eliminate the need for guidance. See our project page: https://cvlab-kaist.github.io/NoiseRefine/.

ノイズは拡散ガイダンスに値する

A Noise is Worth Diffusion Guidance

要旨

Support