확산 모델을 사용한 이미지 간 보간

초록

이미지 생성 및 편집 분야에서 거의 탐구되지 않은 한 가지 영역은 두 입력 이미지 간의 보간 작업으로, 이는 현재 배포된 모든 이미지 생성 파이프라인에서 누락된 기능입니다. 우리는 이러한 기능이 해당 모델의 창의적인 응용을 확장할 수 있다고 주장하며, 잠재 확산 모델을 사용한 제로샷 보간 방법을 제안합니다. 우리는 잠재 공간에서 일련의 감소하는 노이즈 수준에서 보간을 적용한 다음, 텍스트 역전 및 (선택적으로) 대상 포즈에서 파생된 보간된 텍스트 임베딩을 조건으로 디노이징을 수행합니다. 더 높은 일관성을 위해 또는 추가 기준을 지정하기 위해 여러 후보를 생성하고 CLIP을 사용하여 가장 높은 품질의 이미지를 선택할 수 있습니다. 우리는 다양한 대상 포즈, 이미지 스타일 및 이미지 콘텐츠에 걸쳐 설득력 있는 보간 결과를 얻었으며, FID와 같은 표준 정량적 메트릭이 보간의 품질을 측정하기에는 부적절함을 보여줍니다. 코드와 데이터는 https://clintonjwang.github.io/interpolation에서 확인할 수 있습니다.

English

One little-explored frontier of image generation and editing is the task of interpolating between two input images, a feature missing from all currently deployed image generation pipelines. We argue that such a feature can expand the creative applications of such models, and propose a method for zero-shot interpolation using latent diffusion models. We apply interpolation in the latent space at a sequence of decreasing noise levels, then perform denoising conditioned on interpolated text embeddings derived from textual inversion and (optionally) subject poses. For greater consistency, or to specify additional criteria, we can generate several candidates and use CLIP to select the highest quality image. We obtain convincing interpolations across diverse subject poses, image styles, and image content, and show that standard quantitative metrics such as FID are insufficient to measure the quality of an interpolation. Code and data are available at https://clintonjwang.github.io/interpolation.

확산 모델을 사용한 이미지 간 보간

Interpolating between Images with Diffusion Models

초록

Support