コンテキスト予測による拡散ベース画像合成の改善

要旨

拡散モデルは新しいクラスの生成モデルであり、前例のない品質と多様性で画像生成を劇的に促進してきました。既存の拡散モデルは主に、空間軸に沿ったピクセル単位または特徴量単位の制約を用いて、劣化した画像から入力画像を再構築しようとします。しかし、このようなポイントベースの再構築では、各予測ピクセル/特徴量がその周辺コンテキストを完全に保持できない場合があり、拡散ベースの画像合成を損なう可能性があります。自動的な教師信号の強力な源として、コンテキストは表現学習においてよく研究されてきました。これに着想を得て、我々は初めて、コンテキスト予測を用いて拡散ベースの画像合成を改善するConPreDiffを提案します。トレーニング段階において、拡散ノイズ除去ブロックの最後にコンテキストデコーダを追加し、各ポイントがその周辺コンテキスト（つまり、マルチストライドの特徴量/トークン/ピクセル）を予測するように明示的に強化し、推論時にはこのデコーダを除去します。これにより、各ポイントは周辺コンテキストとの意味的つながりを保持することで、自身をより良く再構築できるようになります。ConPreDiffのこの新しいパラダイムは、サンプリング手順で追加のパラメータを導入することなく、任意の離散および連続拡散バックボーンに一般化できます。無条件画像生成、テキストから画像への生成、画像修復タスクにおいて広範な実験が行われました。我々のConPreDiffは、従来の手法を一貫して上回り、MS-COCOにおいて新たなSOTAのテキストから画像への生成結果を達成し、ゼロショットFIDスコア6.21を記録しました。

English

Diffusion models are a new class of generative models, and have dramatically promoted image generation with unprecedented quality and diversity. Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes. However, such point-based reconstruction may fail to make each predicted pixel/feature fully preserve its neighborhood context, impairing diffusion-based image synthesis. As a powerful source of automatic supervisory signal, context has been well studied for learning representations. Inspired by this, we for the first time propose ConPreDiff to improve diffusion-based image synthesis with context prediction. We explicitly reinforce each point to predict its neighborhood context (i.e., multi-stride features/tokens/pixels) with a context decoder at the end of diffusion denoising blocks in training stage, and remove the decoder for inference. In this way, each point can better reconstruct itself by preserving its semantic connections with neighborhood context. This new paradigm of ConPreDiff can generalize to arbitrary discrete and continuous diffusion backbones without introducing extra parameters in sampling procedure. Extensive experiments are conducted on unconditional image generation, text-to-image generation and image inpainting tasks. Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.

コンテキスト予測による拡散ベース画像合成の改善

Improving Diffusion-Based Image Synthesis with Context Prediction

要旨

Support