一貫性のある視覚的合成のための協調的スコア蒸留

要旨

大規模なテキストから画像への拡散モデルの生成事前分布は、多様な視覚モダリティにおける新たな生成および編集アプリケーションの幅広い可能性を可能にします。しかし、これらの事前分布を複数の画像（例えば、ビデオ）として表現される複雑な視覚モダリティに適応させる際、一連の画像間で一貫性を達成することは困難です。本論文では、この課題に対処するために、新規の手法である協調的スコア蒸留（Collaborative Score Distillation, CSD）を提案します。CSDは、Stein変分勾配降下法（Stein Variational Gradient Descent, SVGD）に基づいています。具体的には、複数のサンプルをSVGD更新における「粒子」として考慮し、それらのスコア関数を組み合わせて、一連の画像にわたる生成事前分布を同期して蒸留することを提案します。これにより、CSDは2D画像間での情報のシームレスな統合を促進し、複数のサンプルにわたる一貫した視覚的合成を実現します。我々は、パノラマ画像、ビデオ、3Dシーンの視覚的編集を含む多様なタスクにおいて、CSDの有効性を示します。我々の結果は、CSDがサンプル間の一貫性を向上させる汎用的な手法としての能力を強調し、それによってテキストから画像への拡散モデルの適用範囲を広げることを示しています。

English

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

一貫性のある視覚的合成のための協調的スコア蒸留

Collaborative Score Distillation for Consistent Visual Synthesis

要旨

Support