設計一個更好的非對稱VQGAN以實現穩定的擴散

摘要

StableDiffusion 是一個引起圖像生成和編輯領域轟動的革命性文本到圖像生成器。與傳統方法在像素空間學習擴散模型不同，StableDiffusion 通過 VQGAN 在潛在空間中學習擴散模型，確保了效率和質量。它不僅支持圖像生成任務，還能實現對真實圖像的編輯，如圖像修補和局部編輯。然而，我們觀察到 StableDiffusion 中使用的普通 VQGAN 導致了顯著的信息丟失，甚至在未經編輯的圖像區域中也會產生失真。為此，我們提出了一種新的不對稱 VQGAN，具有兩個簡單的設計。首先，除了來自編碼器的輸入外，解碼器還包含一個條件分支，將任務特定的先驗信息（如修補中的未遮罩圖像區域）納入其中。其次，解碼器比編碼器更為複雜，可以實現更詳細的恢復，同時僅輕微增加總推理成本。我們的不對稱 VQGAN 的訓練成本低廉，我們只需要重新訓練一個新的不對稱解碼器，同時保持普通 VQGAN 編碼器和 StableDiffusion 不變。我們的不對稱 VQGAN 可廣泛應用於基於 StableDiffusion 的修補和局部編輯方法。大量實驗表明，它可以顯著改善修補和編輯性能，同時保持原始的文本到圖像能力。代碼可在 https://github.com/buxiangzhiren/Asymmetric_VQGAN 找到。

English

StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN.

設計一個更好的非對稱VQGAN以實現穩定的擴散

Designing a Better Asymmetric VQGAN for StableDiffusion

摘要

Support