ChatPaper.aiChatPaper

使用NVIDIA NeMo訓練視頻基礎模型

Training Video Foundation Models with NVIDIA NeMo

March 17, 2025
作者: Zeeshan Patel, Ethan He, Parth Mannan, Xiaowei Ren, Ryan Wolf, Niket Agarwal, Jacob Huffman, Zhuoyao Wang, Carl Wang, Jack Chang, Yan Bai, Tommy Huang, Linnan Wang, Sahil Jain, Shanmugam Ramasamy, Joseph Jennings, Ekaterina Sirazitdinova, Oleg Sudakov, Mingyuan Ma, Bobby Chen, Forrest Lin, Hao Wang, Vasanth Rao Naik Sabavat, Sriharsha Niverty, Rong Ou, Pallab Bhattacharya, David Page, Nima Tajbakhsh, Ashwath Aithal
cs.AI

摘要

視訊基礎模型(Video Foundation Models, VFMs)近期被用於模擬真實世界,以訓練物理AI系統並開發創新的視覺體驗。然而,在訓練能夠生成高品質視訊的大規模、高品質VFMs方面,存在著顯著的挑戰。我們提出了一個可擴展的開源VFM訓練管道,結合NVIDIA NeMo,提供了加速的視訊資料集整理、多模態數據載入,以及並行化的視訊擴散模型訓練與推論。此外,我們還提供了一份全面的性能分析,強調了高效VFM訓練與推論的最佳實踐。
English
Video Foundation Models (VFMs) have recently been used to simulate the real world to train physical AI systems and develop creative visual experiences. However, there are significant challenges in training large-scale, high quality VFMs that can generate high-quality videos. We present a scalable, open-source VFM training pipeline with NVIDIA NeMo, providing accelerated video dataset curation, multimodal data loading, and parallelized video diffusion model training and inference. We also provide a comprehensive performance analysis highlighting best practices for efficient VFM training and inference.

Summary

AI-Generated Summary

PDF62March 18, 2025