文本到視頻模型的概率適應

摘要

在網際網路規模的數據上訓練的大型文本到視頻模型展示了在從任意文本描述生成高保真視頻方面的卓越能力。然而，將這些模型適應到具有有限領域特定數據的任務，如動畫或機器人視頻，面臨著重大的計算挑戰，因為微調預訓練的大型模型可能成本過高。受到一個小的可修改組件（例如提示，前綴微調）如何使一個大型語言模型適應執行新任務而無需訪問模型權重的啟發，我們探討如何使一個大型預訓練文本到視頻模型適應各種下游領域和任務而無需微調。為了回答這個問題，我們提出了Video Adapter，它利用了大型預訓練視頻擴散模型的得分函數作為概率先驗，來引導生成一個特定任務的小型視頻模型。我們的實驗表明，Video Adapter 能夠將廣泛知識納入並保留大型預訓練視頻模型的高保真度，在特定任務的小型視頻模型中生成能夠在各種任務上生成高質量但專業化視頻，如動畫、自我中心建模以及模擬和現實世界機器人數據建模。更多視頻可在網站 https://video-adapter.github.io/ 上找到。

English

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a pretrained large model can be prohibitively expensive. Inspired by how a small modifiable component (e.g., prompts, prefix-tuning) can adapt a large language model to perform new tasks without requiring access to the model weights, we investigate how to adapt a large pretrained text-to-video model to a variety of downstream domains and tasks without finetuning. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model that is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. More videos can be found on the website https://video-adapter.github.io/.

文本到視頻模型的概率適應

Probabilistic Adaptation of Text-to-Video Models

摘要

Support