ChatPaper.aiChatPaper

设计一个更好的稳定扩散的不对称VQGAN

Designing a Better Asymmetric VQGAN for StableDiffusion

June 7, 2023
作者: Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, Gang Hua
cs.AI

摘要

StableDiffusion是一种革命性的文本到图像生成器,在图像生成和编辑领域引起轰动。与传统方法在像素空间学习扩散模型不同,StableDiffusion通过VQGAN在潜在空间学习扩散模型,确保效率和质量。它不仅支持图像生成任务,还能实现对真实图像的编辑,如图像修复和局部编辑。然而,我们观察到StableDiffusion中使用的普通VQGAN会导致显著的信息丢失,甚至在未经编辑的图像区域也会出现失真伪影。为此,我们提出了一种新的不对称VQGAN,具有两个简单的设计。首先,除了来自编码器的输入外,解码器还包含一个条件分支,结合来自任务特定先验的信息,如修复中的未遮罩图像区域。其次,解码器比编码器更复杂,可以实现更详细的恢复,而仅略微增加总推理成本。我们的不对称VQGAN的训练成本较低,只需重新训练一个新的不对称解码器,同时保持普通VQGAN编码器和StableDiffusion不变。我们的不对称VQGAN可广泛应用于基于StableDiffusion的修复和局部编辑方法。大量实验证明,它可以显著改善修复和编辑性能,同时保持原始文本到图像的能力。代码可在https://github.com/buxiangzhiren/Asymmetric_VQGAN找到。
English
StableDiffusion is a revolutionary text-to-image generator that is causing a stir in the world of image generation and editing. Unlike traditional methods that learn a diffusion model in pixel space, StableDiffusion learns a diffusion model in the latent space via a VQGAN, ensuring both efficiency and quality. It not only supports image generation tasks, but also enables image editing for real images, such as image inpainting and local editing. However, we have observed that the vanilla VQGAN used in StableDiffusion leads to significant information loss, causing distortion artifacts even in non-edited image regions. To this end, we propose a new asymmetric VQGAN with two simple designs. Firstly, in addition to the input from the encoder, the decoder contains a conditional branch that incorporates information from task-specific priors, such as the unmasked image region in inpainting. Secondly, the decoder is much heavier than the encoder, allowing for more detailed recovery while only slightly increasing the total inference cost. The training cost of our asymmetric VQGAN is cheap, and we only need to retrain a new asymmetric decoder while keeping the vanilla VQGAN encoder and StableDiffusion unchanged. Our asymmetric VQGAN can be widely used in StableDiffusion-based inpainting and local editing methods. Extensive experiments demonstrate that it can significantly improve the inpainting and editing performance, while maintaining the original text-to-image capability. The code is available at https://github.com/buxiangzhiren/Asymmetric_VQGAN.
PDF30December 15, 2024