文本与视频生成桥梁：研究综述

摘要

文本到视频（T2V）生成技术具有变革多个领域的潜力，包括教育、市场营销、娱乐以及为视觉或阅读理解障碍人士提供的辅助技术，它能够从自然语言提示中生成连贯的视觉内容。自诞生以来，该领域已从对抗模型发展到基于扩散的模型，产生了更高保真度、时间一致性更强的输出。然而，挑战依然存在，如对齐、长程连贯性和计算效率。针对这一不断演变的局面，我们对文本到视频生成模型进行了全面综述，追溯了从早期生成对抗网络（GANs）和变分自编码器（VAEs）到混合扩散-Transformer（DiT）架构的发展历程，详细阐述了这些模型的工作原理、它们解决了前代模型的哪些局限，以及为何转向新的架构范式对于克服质量、连贯性和控制方面的挑战是必要的。我们系统性地介绍了这些文本到视频模型训练和评估所用的数据集，并为了支持可重复性和评估训练此类模型的可访问性，详细说明了它们的训练配置，包括硬件规格、GPU数量、批量大小、学习率、优化器、训练轮数及其他关键超参数。此外，我们概述了常用于评估此类模型的评价指标，并展示了它们在标准基准测试中的表现，同时讨论了这些指标的局限性及向更全面、感知对齐的评价策略转变的趋势。最后，基于我们的分析，我们概述了当前面临的开放挑战，并提出了几个有前景的未来研究方向，为未来研究者在推进T2V研究和应用方面探索和构建提供了视角。

English

Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating coherent visual content from natural language prompts. From its inception, the field has advanced from adversarial models to diffusion-based models, yielding higher-fidelity, temporally consistent outputs. Yet challenges persist, such as alignment, long-range coherence, and computational efficiency. Addressing this evolving landscape, we present a comprehensive survey of text-to-video generative models, tracing their development from early GANs and VAEs to hybrid Diffusion-Transformer (DiT) architectures, detailing how these models work, what limitations they addressed in their predecessors, and why shifts toward new architectural paradigms were necessary to overcome challenges in quality, coherence, and control. We provide a systematic account of the datasets, which the surveyed text-to-video models were trained and evaluated on, and, to support reproducibility and assess the accessibility of training such models, we detail their training configurations, including their hardware specifications, GPU counts, batch sizes, learning rates, optimizers, epochs, and other key hyperparameters. Further, we outline the evaluation metrics commonly used for evaluating such models and present their performance across standard benchmarks, while also discussing the limitations of these metrics and the emerging shift toward more holistic, perception-aligned evaluation strategies. Finally, drawing from our analysis, we outline the current open challenges and propose a few promising future directions, laying out a perspective for future researchers to explore and build upon in advancing T2V research and applications.

文本与视频生成桥梁：研究综述

Bridging Text and Video Generation: A Survey

摘要

Support