通过对齐扩散反演链实现真实世界图像变化
Real-World Image Variation by Aligning Diffusion Inversion Chain
May 30, 2023
作者: Yuechen Zhang, Jinbo Xing, Eric Lo, Jiaya Jia
cs.AI
摘要
最近扩散模型的进展使得可以利用文本提示生成高保真度图像。然而,生成图像与真实世界图像之间存在领域差距,这在生成真实世界图像的高质量变体方面构成挑战。我们的研究发现,这种领域差距源于不同扩散过程中潜在分布的差距。为解决这一问题,我们提出了一种名为实际图像变体对齐(RIVAL)的新型推理流程,利用扩散模型从单个图像示例生成图像变体。我们的流程通过将图像生成过程与源图像的反演链对齐,提升了图像变体的生成质量。具体来说,我们展示了逐步潜在分布对齐对于生成高质量变体至关重要。为实现这一目标,我们设计了跨图像自注意注入以实现特征交互,并设计了逐步分布归一化以对齐潜在特征。将这些对齐过程纳入扩散模型使得RIVAL能够生成高质量图像变体,无需进一步参数优化。我们的实验结果表明,我们提出的方法在语义条件相似性和感知质量方面优于现有方法。此外,这种通用推理流程可以轻松应用于其他基于扩散的生成任务,如基于图像条件的文本到图像生成和基于示例的图像修复。
English
Recent diffusion model advancements have enabled high-fidelity images to be
generated using text prompts. However, a domain gap exists between generated
images and real-world images, which poses a challenge in generating
high-quality variations of real-world images. Our investigation uncovers that
this domain gap originates from a latents' distribution gap in different
diffusion processes. To address this issue, we propose a novel inference
pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes
diffusion models to generate image variations from a single image exemplar. Our
pipeline enhances the generation quality of image variations by aligning the
image generation process to the source image's inversion chain. Specifically,
we demonstrate that step-wise latent distribution alignment is essential for
generating high-quality variations. To attain this, we design a cross-image
self-attention injection for feature interaction and a step-wise distribution
normalization to align the latent features. Incorporating these alignment
processes into a diffusion model allows RIVAL to generate high-quality image
variations without further parameter optimization. Our experimental results
demonstrate that our proposed approach outperforms existing methods with
respect to semantic-condition similarity and perceptual quality. Furthermore,
this generalized inference pipeline can be easily applied to other
diffusion-based generation tasks, such as image-conditioned text-to-image
generation and example-based image inpainting.