文本到视频模型的概率适应

摘要

基于互联网规模数据训练的大型文本到视频模型展现出在从任意文本描述生成高保真视频方面的卓越能力。然而，将这些模型调整到具有有限领域特定数据的任务，如动画或机器人视频，面临着重要的计算挑战，因为微调预训练的大型模型可能成本过高。受到一个小的可修改组件（例如提示、前缀微调）如何调整大型语言模型以执行新任务而无需访问模型权重的启发，我们研究如何调整一个大型预训练文本到视频模型以适应各种下游领域和任务而无需微调。在回答这个问题时，我们提出了Video Adapter，它利用大型预训练视频扩散模型的得分函数作为概率先验，引导生成一个特定任务的小型视频模型。我们的实验表明，Video Adapter能够将广泛知识整合到一个特定任务的小型视频模型中，并保留大型预训练视频模型的高保真度，从而能够在各种任务上生成高质量且专业化的视频，如动画、自我中心建模以及模拟和真实世界机器人数据建模。更多视频可在网站https://video-adapter.github.io/找到。

English

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a pretrained large model can be prohibitively expensive. Inspired by how a small modifiable component (e.g., prompts, prefix-tuning) can adapt a large language model to perform new tasks without requiring access to the model weights, we investigate how to adapt a large pretrained text-to-video model to a variety of downstream domains and tasks without finetuning. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model that is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. More videos can be found on the website https://video-adapter.github.io/.

文本到视频模型的概率适应

Probabilistic Adaptation of Text-to-Video Models

摘要

Support