テキストからビデオへのモデルの確率的適応

要旨

インターネット規模のデータで学習された大規模なテキスト-to-ビデオモデルは、任意のテキスト記述から高精細なビデオを生成する際に卓越した能力を示しています。しかし、アニメーションやロボティクスのビデオなど、ドメイン固有のデータが限られたタスクにこれらのモデルを適応させることは、事前学習済みの大規模モデルをファインチューニングすることが非常に高コストであるため、大きな計算上の課題となります。大規模言語モデルが、モデルの重みにアクセスすることなく、小さな変更可能なコンポーネント（例：プロンプト、プレフィックスチューニング）を使用して新しいタスクを実行できる方法に着想を得て、我々は、ファインチューニングを行わずに、事前学習済みの大規模テキスト-to-ビデオモデルを様々な下流ドメインやタスクに適応させる方法を調査します。この問いに答えるために、我々はVideo Adapterを提案します。これは、大規模な事前学習済みビデオ拡散モデルのスコア関数を確率的な事前分布として活用し、タスク固有の小さなビデオモデルの生成を導くものです。我々の実験では、Video Adapterが、大規模な事前学習済みビデオモデルの広範な知識を取り入れ、その高精細性を維持しつつ、アニメーション、エゴセントリックモデリング、シミュレーションおよび実世界のロボティクスデータのモデリングなど、様々なタスクにおいて高品質でありながら専門的なビデオを生成できるタスク固有の小さなビデオモデルを生成できることを示しています。詳細なビデオはウェブサイトhttps://video-adapter.github.io/でご覧いただけます。

English

Large text-to-video models trained on internet-scale data have demonstrated exceptional capabilities in generating high-fidelity videos from arbitrary textual descriptions. However, adapting these models to tasks with limited domain-specific data, such as animation or robotics videos, poses a significant computational challenge, since finetuning a pretrained large model can be prohibitively expensive. Inspired by how a small modifiable component (e.g., prompts, prefix-tuning) can adapt a large language model to perform new tasks without requiring access to the model weights, we investigate how to adapt a large pretrained text-to-video model to a variety of downstream domains and tasks without finetuning. In answering this question, we propose Video Adapter, which leverages the score function of a large pretrained video diffusion model as a probabilistic prior to guide the generation of a task-specific small video model. Our experiments show that Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model that is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data. More videos can be found on the website https://video-adapter.github.io/.

テキストからビデオへのモデルの確率的適応

Probabilistic Adaptation of Text-to-Video Models

要旨

Support