協作式分數蒸餾以維持一致的視覺合成

摘要

大規模文本到圖像擴散模型的生成先驗使得在不同視覺模態上能夠進行各種新的生成和編輯應用。然而，當將這些先驗適應到複雜的視覺模態，通常以多個圖像（例如，視頻）表示，實現一組圖像的一致性是具有挑戰性的。在本文中，我們通過一種新穎的方法，即協作分數蒸餾（CSD），來應對這一挑戰。CSD基於Stein變分梯度下降（SVGD）。具體來說，我們建議將多個樣本視為SVGD更新中的“粒子”，並結合它們的分數函數以同步蒸餾一組圖像的生成先驗。因此，CSD促進了在2D圖像之間無縫整合信息，從而實現跨多個樣本的一致視覺合成。我們展示了CSD在各種任務中的有效性，包括全景圖像、視頻和3D場景的視覺編輯。我們的結果突顯了CSD作為一種多才多藝的方法，用於增強樣本間一致性，從而擴大文本到圖像擴散模型的應用範圍。

English

Generative priors of large-scale text-to-image diffusion models enable a wide range of new generation and editing applications on diverse visual modalities. However, when adapting these priors to complex visual modalities, often represented as multiple images (e.g., video), achieving consistency across a set of images is challenging. In this paper, we address this challenge with a novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein Variational Gradient Descent (SVGD). Specifically, we propose to consider multiple samples as "particles" in the SVGD update and combine their score functions to distill generative priors over a set of images synchronously. Thus, CSD facilitates seamless integration of information across 2D images, leading to a consistent visual synthesis across multiple samples. We show the effectiveness of CSD in a variety of tasks, encompassing the visual editing of panorama images, videos, and 3D scenes. Our results underline the competency of CSD as a versatile method for enhancing inter-sample consistency, thereby broadening the applicability of text-to-image diffusion models.

協作式分數蒸餾以維持一致的視覺合成

Collaborative Score Distillation for Consistent Visual Synthesis

摘要

Support