실제 세계 이미지 변형을 위한 확산 역전 체인 정렬

초록

최근 확산 모델의 발전으로 텍스트 프롬프트를 사용하여 고해상도 이미지를 생성할 수 있게 되었습니다. 그러나 생성된 이미지와 실제 세계의 이미지 사이에는 도메인 간격이 존재하며, 이는 실제 세계 이미지의 고품질 변형을 생성하는 데 있어 어려움을 야기합니다. 우리의 연구에 따르면, 이러한 도메인 간격은 서로 다른 확산 과정에서의 잠재 변수 분포 차이에서 비롯됩니다. 이 문제를 해결하기 위해, 우리는 단일 이미지 예제로부터 이미지 변형을 생성하기 위해 확산 모델을 활용하는 Real-world Image Variation by ALignment (RIVAL)이라는 새로운 추론 파이프라인을 제안합니다. 우리의 파이프라인은 이미지 생성 과정을 소스 이미지의 역전 사슬에 맞추어 정렬함으로써 이미지 변형의 생성 품질을 향상시킵니다. 특히, 단계별 잠재 변수 분포 정렬이 고품질 변형을 생성하는 데 필수적임을 입증합니다. 이를 달성하기 위해, 우리는 특징 상호 작용을 위한 교차 이미지 자기 주입 주의 메커니즘과 잠재 특징을 정렬하기 위한 단계별 분포 정규화를 설계했습니다. 이러한 정렬 과정을 확산 모델에 통합함으로써, RIVAL은 추가적인 매개변수 최적화 없이도 고품질 이미지 변형을 생성할 수 있습니다. 우리의 실험 결과는 제안된 접근 방식이 의미론적 조건 유사성과 지각적 품질 측면에서 기존 방법들을 능가함을 보여줍니다. 또한, 이 일반화된 추론 파이프라인은 이미지 조건부 텍스트-이미지 생성 및 예제 기반 이미지 인페인팅과 같은 다른 확산 기반 생성 작업에도 쉽게 적용될 수 있습니다.

English

Recent diffusion model advancements have enabled high-fidelity images to be generated using text prompts. However, a domain gap exists between generated images and real-world images, which poses a challenge in generating high-quality variations of real-world images. Our investigation uncovers that this domain gap originates from a latents' distribution gap in different diffusion processes. To address this issue, we propose a novel inference pipeline called Real-world Image Variation by ALignment (RIVAL) that utilizes diffusion models to generate image variations from a single image exemplar. Our pipeline enhances the generation quality of image variations by aligning the image generation process to the source image's inversion chain. Specifically, we demonstrate that step-wise latent distribution alignment is essential for generating high-quality variations. To attain this, we design a cross-image self-attention injection for feature interaction and a step-wise distribution normalization to align the latent features. Incorporating these alignment processes into a diffusion model allows RIVAL to generate high-quality image variations without further parameter optimization. Our experimental results demonstrate that our proposed approach outperforms existing methods with respect to semantic-condition similarity and perceptual quality. Furthermore, this generalized inference pipeline can be easily applied to other diffusion-based generation tasks, such as image-conditioned text-to-image generation and example-based image inpainting.

실제 세계 이미지 변형을 위한 확산 역전 체인 정렬

Real-World Image Variation by Aligning Diffusion Inversion Chain

초록

Support