电影生成:媒体基础模型的演员阵容
Movie Gen: A Cast of Media Foundation Models
October 17, 2024
作者: Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Peter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, Elliot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopoulos, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, Yuming Du
cs.AI
摘要
我们提出了Movie Gen,这是一组基础模型,可以生成具有不同宽高比和同步音频的高质量1080p高清视频。我们还展示了额外的功能,如基于精确指令的视频编辑和基于用户图像生成个性化视频。我们的模型在多个任务上树立了新的技术水准:文本到视频合成、视频个性化、视频编辑、视频到音频生成以及文本到音频生成。我们最大的视频生成模型是一个拥有30B参数的Transformer,训练时最大上下文长度为73K视频标记,对应于以每秒16帧生成的16秒视频。我们展示了在架构、潜在空间、训练目标和配方、数据筛选、评估协议、并行化技术以及推理优化方面的多项技术创新和简化,这些技术使我们能够通过扩大预训练数据、模型规模和训练计算的规模来训练大规模媒体生成模型。我们希望本文能帮助研究界加速媒体生成模型的进展和创新。本文中的所有视频均可在https://go.fb.me/MovieGenResearchVideos 上找到。
English
We present Movie Gen, a cast of foundation models that generates
high-quality, 1080p HD videos with different aspect ratios and synchronized
audio. We also show additional capabilities such as precise instruction-based
video editing and generation of personalized videos based on a user's image.
Our models set a new state-of-the-art on multiple tasks: text-to-video
synthesis, video personalization, video editing, video-to-audio generation, and
text-to-audio generation. Our largest video generation model is a 30B parameter
transformer trained with a maximum context length of 73K video tokens,
corresponding to a generated video of 16 seconds at 16 frames-per-second. We
show multiple technical innovations and simplifications on the architecture,
latent spaces, training objectives and recipes, data curation, evaluation
protocols, parallelization techniques, and inference optimizations that allow
us to reap the benefits of scaling pre-training data, model size, and training
compute for training large scale media generation models. We hope this paper
helps the research community to accelerate progress and innovation in media
generation models. All videos from this paper are available at
https://go.fb.me/MovieGenResearchVideos.Summary
AI-Generated Summary