通過對齊擴散反演鏈實現真實世界圖像變化

摘要

最近擴散模型的進步使得可以使用文本提示生成高保真度的圖像。然而，生成的圖像與現實世界的圖像之間存在領域差距，這在生成現實世界圖像的高質量變化方面構成挑戰。我們的研究揭示了這種領域差距源於不同擴散過程中潛在分佈差距。為了解決這個問題，我們提出了一種名為Real-world Image Variation by ALignment (RIVAL)的新型推理流程，該流程利用擴散模型從單一圖像示例生成圖像變化。我們的流程通過將圖像生成過程與源圖像的反演鏈對齊來增強圖像變化的生成質量。具體來說，我們證明了逐步潛在分佈對齊對於生成高質量變化是至關重要的。為了實現這一目標，我們設計了一種用於特徵交互的跨圖像自注意力注入和逐步分佈歸一化以對齊潛在特徵。將這些對齊過程納入擴散模型使得RIVAL能夠生成高質量的圖像變化，而無需進行進一步的參數優化。我們的實驗結果表明，我們提出的方法在語義條件相似性和感知質量方面優於現有方法。此外，這種通用推理流程可以輕鬆應用於其他基於擴散的生成任務，如圖像條件下的文本到圖像生成和基於示例的圖像修補。

English

Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods with respect to semantic-condition similarity and perceptual quality. Furthermore, this generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and example-based image inpainting.

通過對齊擴散反演鏈實現真實世界圖像變化

Real-World Image Variation by Aligning Diffusion Inversion Chain

摘要

Support