事前分布の探索がテキストからビデオへの合成を改善する

要旨

ビデオ拡散モデルの著しい進歩により、テキストからビデオ（T2V）合成の分野は大幅な進展を遂げています。しかし、既存のT2V合成モデルは複雑なモーションダイナミクスを正確に生成することが難しく、ビデオのリアリズムが低下するという課題があります。この問題を解決する一つの方法として、大量のデータを収集し、モデルをトレーニングすることが考えられますが、これは非常にコストがかかります。この問題を軽減するため、本論文では、典型的なT2V生成プロセスを検索ベースの生成パイプラインとして再構築します。モデルのトレーニングをスケールアップする代わりに、既存のビデオをモーション事前知識データベースとして活用します。具体的には、T2V生成プロセスを以下の2つのステップに分けます：(i) 与えられたプロンプト入力に対して、既存のテキスト-ビデオデータセットを検索し、プロンプトのモーションに最も近いテキストラベルを持つビデオを見つけます。このために、オブジェクトのモーション特徴を重視した独自の検索アルゴリズムを提案します。(ii) 検索されたビデオを処理し、モーション事前知識として蒸留して、事前にトレーニングされたベースT2Vモデルを微調整し、入力プロンプトを使用して目的のビデオを生成します。検索されたビデオから得られた事前知識を活用することで、生成されたビデオのモーションのリアリズムを向上させます。すべての操作は、単一のNVIDIA RTX 4090 GPUで完了できます。我々の手法を、多様なプロンプト入力に対して最先端のT2Vモデルと比較検証します。コードは公開予定です。

English

Significant advancements in video diffusion models have brought substantial progress to the field of text-to-video (T2V) synthesis. However, existing T2V synthesis model struggle to accurately generate complex motion dynamics, leading to a reduction in video realism. One possible solution is to collect massive data and train the model on it, but this would be extremely expensive. To alleviate this problem, in this paper, we reformulate the typical T2V generation process as a search-based generation pipeline. Instead of scaling up the model training, we employ existing videos as the motion prior database. Specifically, we divide T2V generation process into two steps: (i) For a given prompt input, we search existing text-video datasets to find videos with text labels that closely match the prompt motions. We propose a tailored search algorithm that emphasizes object motion features. (ii) Retrieved videos are processed and distilled into motion priors to fine-tune a pre-trained base T2V model, followed by generating desired videos using input prompt. By utilizing the priors gleaned from the searched videos, we enhance the realism of the generated videos' motion. All operations can be finished on a single NVIDIA RTX 4090 GPU. We validate our method against state-of-the-art T2V models across diverse prompt inputs. The code will be public.

事前分布の探索がテキストからビデオへの合成を改善する

Searching Priors Makes Text-to-Video Synthesis Better

要旨

Support