文本與視頻生成之間的橋接：綜述研究

摘要

文本到視頻（T2V）生成技術具有變革多個領域的潛力，如教育、市場營銷、娛樂以及為視覺或閱讀理解障礙人士提供的輔助技術，通過從自然語言提示中創建連貫的視覺內容。自其誕生以來，該領域已從對抗模型發展到基於擴散的模型，產生了更高保真度、時間一致性的輸出。然而，挑戰依然存在，如對齊、長程連貫性和計算效率。針對這一不斷演變的格局，我們對文本到視頻生成模型進行了全面調查，追溯了從早期GANs和VAEs到混合擴散-變換器（DiT）架構的發展，詳細說明了這些模型的工作原理、它們如何解決了前代模型的局限性，以及為何向新架構範式的轉變對於克服質量、連貫性和控制方面的挑戰是必要的。我們系統地介紹了這些文本到視頻模型訓練和評估所用的數據集，並為了支持可重現性和評估訓練此類模型的可訪問性，我們詳細說明了它們的訓練配置，包括硬件規格、GPU數量、批次大小、學習率、優化器、訓練輪次和其他關鍵超參數。此外，我們概述了常用於評估此類模型的評價指標，並展示了它們在標準基準測試中的表現，同時也討論了這些指標的局限性以及向更全面、感知對齊的評價策略的新興轉變。最後，基於我們的分析，我們概述了當前的開放挑戰，並提出了幾個有前景的未來方向，為未來研究者在推進T2V研究和應用方面探索和構建提供了視角。

English

Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating coherent visual content from natural language prompts. From its inception, the field has advanced from adversarial models to diffusion-based models, yielding higher-fidelity, temporally consistent outputs. Yet challenges persist, such as alignment, long-range coherence, and computational efficiency. Addressing this evolving landscape, we present a comprehensive survey of text-to-video generative models, tracing their development from early GANs and VAEs to hybrid Diffusion-Transformer (DiT) architectures, detailing how these models work, what limitations they addressed in their predecessors, and why shifts toward new architectural paradigms were necessary to overcome challenges in quality, coherence, and control. We provide a systematic account of the datasets, which the surveyed text-to-video models were trained and evaluated on, and, to support reproducibility and assess the accessibility of training such models, we detail their training configurations, including their hardware specifications, GPU counts, batch sizes, learning rates, optimizers, epochs, and other key hyperparameters. Further, we outline the evaluation metrics commonly used for evaluating such models and present their performance across standard benchmarks, while also discussing the limitations of these metrics and the emerging shift toward more holistic, perception-aligned evaluation strategies. Finally, drawing from our analysis, we outline the current open challenges and propose a few promising future directions, laying out a perspective for future researchers to explore and build upon in advancing T2V research and applications.

文本與視頻生成之間的橋接：綜述研究

Bridging Text and Video Generation: A Survey

摘要

Support