安定した拡散のための非対称VQGANの改善設計

要旨

StableDiffusionは、画像生成と編集の世界に大きな波紋を投げかけている革新的なテキストから画像への生成モデルです。従来のピクセル空間で拡散モデルを学習する方法とは異なり、StableDiffusionはVQGANを介して潜在空間で拡散モデルを学習し、効率性と品質の両方を確保しています。これは画像生成タスクをサポートするだけでなく、画像修復や局所的な編集など、実画像の編集も可能にします。しかし、StableDiffusionで使用されている標準的なVQGANは、情報の大幅な損失を引き起こし、編集されていない画像領域でも歪みのアーティファクトを生じさせることが観察されています。この問題に対処するため、我々は2つのシンプルな設計を持つ新しい非対称VQGANを提案します。まず、エンコーダからの入力に加えて、デコーダには修復タスクにおける未マスク画像領域などのタスク固有の事前情報を取り込む条件分岐を含めます。次に、デコーダはエンコーダよりもはるかに重く設計されており、総推論コストをわずかに増加させるだけで、より詳細な復元を可能にします。我々の非対称VQGANの学習コストは低く、標準的なVQGANエンコーダとStableDiffusionを変更せずに、新しい非対称デコーダのみを再学習する必要があります。この非対称VQGANは、StableDiffusionベースの画像修復や局所編集手法に広く適用可能です。大規模な実験により、元のテキストから画像への能力を維持しつつ、修復と編集の性能を大幅に向上させることが実証されています。コードはhttps://github.com/buxiangzhiren/Asymmetric_VQGANで公開されています。

English

StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN.

安定した拡散のための非対称VQGANの改善設計

Designing a Better Asymmetric VQGAN for StableDiffusion

要旨

Support