設計一個更好的非對稱VQGAN以實現穩定的擴散
Designing a Better Asymmetric VQGAN for StableDiffusion
June 7, 2023
作者: Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua
cs.AI
摘要
StableDiffusion 是一個引起圖像生成和編輯領域轟動的革命性文本到圖像生成器。與傳統方法在像素空間學習擴散模型不同,StableDiffusion 通過 VQGAN 在潛在空間中學習擴散模型,確保了效率和質量。它不僅支持圖像生成任務,還能實現對真實圖像的編輯,如圖像修補和局部編輯。然而,我們觀察到 StableDiffusion 中使用的普通 VQGAN 導致了顯著的信息丟失,甚至在未經編輯的圖像區域中也會產生失真。為此,我們提出了一種新的不對稱 VQGAN,具有兩個簡單的設計。首先,除了來自編碼器的輸入外,解碼器還包含一個條件分支,將任務特定的先驗信息(如修補中的未遮罩圖像區域)納入其中。其次,解碼器比編碼器更為複雜,可以實現更詳細的恢復,同時僅輕微增加總推理成本。我們的不對稱 VQGAN 的訓練成本低廉,我們只需要重新訓練一個新的不對稱解碼器,同時保持普通 VQGAN 編碼器和 StableDiffusion 不變。我們的不對稱 VQGAN 可廣泛應用於基於 StableDiffusion 的修補和局部編輯方法。大量實驗表明,它可以顯著改善修補和編輯性能,同時保持原始的文本到圖像能力。代碼可在 https://github.com/buxiangzhiren/Asymmetric_VQGAN 找到。
English
StableDiffusion is a revolutionary text-to-image generator that is causing a
stir in the world of image generation and editing. Unlike traditional methods
that learn a diffusion model in pixel space, StableDiffusion learns a diffusion
model in the latent space via a VQGAN, ensuring both efficiency and quality. It
not only supports image generation tasks, but also enables image editing for
real images, such as image inpainting and local editing. However, we have
observed that the vanilla VQGAN used in StableDiffusion leads to significant
information loss, causing distortion artifacts even in non-edited image
regions. To this end, we propose a new asymmetric VQGAN with two simple
designs. Firstly, in addition to the input from the encoder, the decoder
contains a conditional branch that incorporates information from task-specific
priors, such as the unmasked image region in inpainting. Secondly, the decoder
is much heavier than the encoder, allowing for more detailed recovery while
only slightly increasing the total inference cost. The training cost of our
asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder
while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our
asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and
local editing methods. Extensive experiments demonstrate that it can
significantly improve the inpainting and editing performance, while maintaining
the original text-to-image capability. The code is available at
https://github.com/buxiangzhiren/Asymmetric_VQGAN.