使用NVIDIA NeMo训练视频基础模型
Training Video Foundation Models with NVIDIA NeMo
March 17, 2025
作者: Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Nima Tajbakhsh, Ashwath Aithal
cs.AI
摘要
视频基础模型(VFMs)近期被用于模拟现实世界,以训练物理AI系统并开发创意视觉体验。然而,在训练能够生成高质量视频的大规模、高质量VFMs方面,仍存在显著挑战。我们展示了一个可扩展的开源VFM训练流程,基于NVIDIA NeMo,提供了加速的视频数据集整理、多模态数据加载以及并行化的视频扩散模型训练与推理。此外,我们还提供了一份全面的性能分析,强调了高效VFM训练与推理的最佳实践。
English
Video Foundation Models (VFMs) have recently been used to simulate the real
world to train physical AI systems and develop creative visual experiences.
However, there are significant challenges in training large-scale, high quality
VFMs that can generate high-quality videos. We present a scalable, open-source
VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset
curation, multimodal data loading, and parallelized video diffusion model
training and inference. We also provide a comprehensive performance analysis
highlighting best practices for efficient VFM training and inference.Summary
AI-Generated Summary