テキストとビデオ生成の橋渡し：サーベイ

要旨

テキストから動画を生成する（Text-to-video, T2V）技術は、自然言語のプロンプトから一貫性のある視覚コンテンツを生成することにより、教育、マーケティング、エンターテインメント、視覚や読解力に課題を抱える個人向けの支援技術など、複数の分野を変革する可能性を秘めている。この分野は、その始まりから敵対的生成モデル（GAN）を経て拡散モデル（Diffusion-based models）へと進化し、より高精細で時間的に一貫性のある出力を実現してきた。しかし、アラインメント、長期的な一貫性、計算効率といった課題は依然として残されている。この進化する状況に対応するため、本論文ではテキストから動画を生成するモデルに関する包括的な調査を提供し、初期のGANやVAEからハイブリッドなDiffusion-Transformer（DiT）アーキテクチャまでの発展を追跡する。これらのモデルがどのように機能し、先行モデルのどのような限界を克服し、品質、一貫性、制御性の課題を乗り越えるために新しいアーキテクチャのパラダイムシフトがなぜ必要であったかを詳細に説明する。さらに、調査対象となったテキストから動画を生成するモデルの学習と評価に使用されたデータセットを体系的に整理し、再現性を支援し、これらのモデルの学習のアクセシビリティを評価するために、ハードウェア仕様、GPU数、バッチサイズ、学習率、オプティマイザ、エポック数、その他の主要なハイパーパラメータを含む学習設定を詳述する。さらに、これらのモデルの評価に一般的に使用される評価指標を概説し、標準ベンチマークにおける性能を示すとともに、これらの指標の限界と、より包括的で知覚に沿った評価戦略への新たなシフトについても議論する。最後に、我々の分析に基づいて、現在の未解決の課題を概説し、将来の研究者がT2Vの研究と応用を進める上で探求し、発展させるための視点を提示する。

English

Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating coherent visual content from natural language prompts. From its inception, the field has advanced from adversarial models to diffusion-based models, yielding higher-fidelity, temporally consistent outputs. Yet challenges persist, such as alignment, long-range coherence, and computational efficiency. Addressing this evolving landscape, we present a comprehensive survey of text-to-video generative models, tracing their development from early GANs and VAEs to hybrid Diffusion-Transformer (DiT) architectures, detailing how these models work, what limitations they addressed in their predecessors, and why shifts toward new architectural paradigms were necessary to overcome challenges in quality, coherence, and control. We provide a systematic account of the datasets, which the surveyed text-to-video models were trained and evaluated on, and, to support reproducibility and assess the accessibility of training such models, we detail their training configurations, including their hardware specifications, GPU counts, batch sizes, learning rates, optimizers, epochs, and other key hyperparameters. Further, we outline the evaluation metrics commonly used for evaluating such models and present their performance across standard benchmarks, while also discussing the limitations of these metrics and the emerging shift toward more holistic, perception-aligned evaluation strategies. Finally, drawing from our analysis, we outline the current open challenges and propose a few promising future directions, laying out a perspective for future researchers to explore and build upon in advancing T2V research and applications.

テキストとビデオ生成の橋渡し：サーベイ

Bridging Text and Video Generation: A Survey

要旨

Support