自訂影片：利用多個主題定制文本生成影片

摘要

定制化文本到視頻生成旨在根據文本提示和主題參考生成高質量視頻。目前針對單個主題設計的方法在應對多個主題時存在困難，這是一個更具挑戰性和實用性的情境。在這項工作中，我們旨在推廣多主題引導的文本到視頻定制化。我們提出了CustomVideo，一個新穎的框架，可以在多個主題的引導下生成保持身份的視頻。具體而言，首先，我們通過將多個主題組合在單個圖像中來促進多個主題的共同出現。此外，在基本的文本到視頻擴散模型之上，我們設計了一種簡單而有效的注意力控制策略，以在擴散模型的潛在空間中解開不同主題。此外，為了幫助模型專注於特定對象區域，我們從給定的參考圖像中分割對象並為注意力學習提供相應的對象遮罩。此外，我們收集了一個多主題文本到視頻生成數據集作為全面的基準，其中包含69個單獨的主題和57個有意義的配對。廣泛的定性、定量和用戶研究結果顯示，與先前的最先進方法相比，我們的方法具有卓越性。

English

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches designed for single subjects suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, we aim to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific object area, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 69 individual subjects and 57 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method, compared with the previous state-of-the-art approaches.

自訂影片：利用多個主題定制文本生成影片

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

摘要

Support