Adaptation probabiliste des modèles texte-vidéo

Résumé

Les grands modèles texte-à-vidéo entraînés sur des données à l'échelle d'Internet ont démontré des capacités exceptionnelles à générer des vidéos de haute fidélité à partir de descriptions textuelles arbitraires. Cependant, l'adaptation de ces modèles à des tâches avec des données spécifiques limitées, comme l'animation ou les vidéos de robotique, représente un défi computationnel important, car le fine-tuning d'un grand modèle pré-entraîné peut s'avérer prohibitif en termes de coût. Inspirés par la manière dont un petit composant modifiable (par exemple, les prompts, le prefix-tuning) peut adapter un grand modèle de langage pour effectuer de nouvelles tâches sans nécessiter l'accès aux poids du modèle, nous explorons comment adapter un grand modèle texte-à-vidéo pré-entraîné à une variété de domaines et tâches en aval sans fine-tuning. Pour répondre à cette question, nous proposons Video Adapter, qui exploite la fonction de score d'un grand modèle de diffusion vidéo pré-entraîné comme un a priori probabiliste pour guider la génération d'un petit modèle vidéo spécifique à une tâche. Nos expériences montrent que Video Adapter est capable d'intégrer les connaissances étendues et de préserver la haute fidélité d'un grand modèle vidéo pré-entraîné dans un petit modèle vidéo spécifique à une tâche, capable de générer des vidéos de haute qualité mais spécialisées sur une variété de tâches telles que l'animation, la modélisation égocentrique, et la modélisation de données de robotique simulées et réelles. Plus de vidéos sont disponibles sur le site https://video-adapter.github.io/.

English

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a pretrained large model can be prohibitively expensive. Inspired by how a small modifiable component (e.g., prompts, prefix-tuning) can adapt a large language model to perform new tasks without requiring access to the model weights, we investigate how to adapt a large pretrained text-to-video model to a variety of downstream domains and tasks without finetuning. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model that is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. More videos can be found on the website https://video-adapter.github.io/.

Adaptation probabiliste des modèles texte-vidéo

Probabilistic Adaptation of Text-to-Video Models

Résumé

Support