ChatPaper.aiChatPaper

靜態移動:無需定製視頻數據即可生成定製視頻

Still-Moving: Customized Video Generation without Customized Video Data

July 11, 2024
作者: Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri
cs.AI

摘要

最近,定制文本到圖像(T2I)模型在個性化、風格化和條件生成等領域取得了巨大進展。然而,將這種進展擴展到視頻生成仍處於起步階段,主要是由於缺乏定制視頻數據。在這項工作中,我們引入了一個名為Still-Moving的新型通用框架,用於定制文本到視頻(T2V)模型,而無需任何定制視頻數據。該框架適用於主流的T2V設計,其中視頻模型是基於文本到圖像(T2I)模型構建的(例如,通過膨脹)。我們假設可以訪問定制版本的T2I模型,該模型僅在靜態圖像數據上進行了訓練(例如,使用DreamBooth或StyleDrop)。將定制T2I模型的權重直接插入T2V模型通常會導致顯著的瑕疵或不足以符合定制數據。為了克服這個問題,我們訓練了輕量級的空間適配器,用於調整注入的T2I層生成的特徵。重要的是,我們的適配器是在“凍結視頻”(即重複圖像)上進行訓練的,這些視頻是通過定制T2I模型生成的圖像樣本構建的。這種訓練是通過一個新型的運動適配器模塊進行的,該模塊使我們能夠在保留視頻模型運動先驗的同時在靜態視頻上進行訓練。在測試時,我們刪除運動適配器模塊,僅保留訓練好的空間適配器。這樣可以恢復T2V模型的運動先驗,同時遵循定制T2I模型的空間先驗。我們在各種任務上展示了我們方法的有效性,包括個性化、風格化和條件生成。在所有評估的場景中,我們的方法無縫地將定制T2I模型的空間先驗與T2V模型提供的運動先驗整合在一起。
English
Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

Summary

AI-Generated Summary

PDF132November 28, 2024