일관된 시각적 합성을 위한 협업적 점수 증류

초록

대규모 텍스트-이미지 확산 모델의 생성적 사전 지식은 다양한 시각적 양식에 걸쳐 새로운 생성 및 편집 애플리케이션의 폭넓은 범위를 가능하게 합니다. 그러나 이러한 사전 지식을 복잡한 시각적 양식(예: 비디오와 같은 다중 이미지)에 적용할 때, 일련의 이미지 간 일관성을 달성하는 것은 어려운 과제입니다. 본 논문에서는 이러한 과제를 해결하기 위해 새로운 방법인 협력적 점수 증류(Collaborative Score Distillation, CSD)를 제안합니다. CSD는 Stein 변분 경사 하강법(Stein Variational Gradient Descent, SVGD)을 기반으로 합니다. 구체적으로, 우리는 다중 샘플을 SVGD 업데이트에서 "입자"로 간주하고 이들의 점수 함수를 결합하여 일련의 이미지에 걸쳐 생성적 사전 지식을 동기적으로 증류할 것을 제안합니다. 이를 통해 CSD는 2D 이미지 간 정보의 원활한 통합을 촉진하여 다중 샘플 간 일관된 시각적 합성을 이끌어냅니다. 우리는 파노라마 이미지, 비디오, 3D 장면의 시각적 편집을 포함한 다양한 작업에서 CSD의 효과성을 입증합니다. 우리의 결과는 CSD가 샘플 간 일관성을 강화하는 다목적 방법으로서의 능력을 보여주며, 이를 통해 텍스트-이미지 확산 모델의 적용 범위를 확장합니다.

English

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

일관된 시각적 합성을 위한 협업적 점수 증류

Collaborative Score Distillation for Consistent Visual Synthesis

초록

Support