静止移动：定制视频生成无需定制视频数据

摘要

最近，定制文本到图像（T2I）模型取得了巨大进展，特别是在个性化、风格化和条件生成等领域。然而，将这一进展扩展到视频生成仍处于起步阶段，主要是由于缺乏定制视频数据。在这项工作中，我们引入了Still-Moving，一个新颖的通用框架，用于定制文本到视频（T2V）模型，而无需任何定制视频数据。该框架适用于著名的T2V设计，其中视频模型是基于文本到图像（T2I）模型构建的（例如，通过膨胀）。我们假设可以访问定制版本的T2I模型，该模型仅在静态图像数据上进行训练（例如，使用DreamBooth或StyleDrop）。简单地将定制T2I模型的权重插入T2V模型通常会导致显着的伪影或不足的符合定制数据。为了克服这个问题，我们训练了轻量级的空间适配器，用于调整注入的T2I层生成的特征。重要的是，我们的适配器是在“冻结视频”（即，重复图像）上进行训练的，这些视频是由定制T2I模型生成的图像样本构建的。这种训练是通过一种新颖的运动适配器模块实现的，该模块允许我们在保留视频模型的运动先验的同时在这些静态视频上进行训练。在测试时，我们移除运动适配器模块，只保留训练好的空间适配器。这样可以恢复T2V模型的运动先验，同时符合定制T2I模型的空间先验。我们在个性化、风格化和条件生成等各种任务上展示了我们方法的有效性。在所有评估的场景中，我们的方法无缝地将定制T2I模型的空间先验与T2V模型提供的运动先验相结合。

English

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

静止移动：定制视频生成无需定制视频数据

Still-Moving: Customized Video Generation without Customized Video Data

摘要

Support