ChatPaper.aiChatPaper

MUG-V 10B:大型視頻生成模型的高效訓練管道

MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

October 20, 2025
作者: Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng
cs.AI

摘要

近年來,大規模視覺內容生成模型(如圖像、視頻及3D物體/場景)取得了顯著進展。然而,由於跨模態文本-視頻對齊、涉及的長序列以及複雜的時空依賴性,訓練大規模視頻生成模型仍面臨巨大挑戰且資源消耗嚴重。為應對這些挑戰,我們提出了一個訓練框架,該框架優化了四大支柱:(i)數據處理,(ii)模型架構,(iii)訓練策略,以及(iv)大規模視頻生成模型的基礎設施。這些優化在數據預處理、視頻壓縮、參數擴展、基於課程的預訓練及對齊導向的後訓練等所有階段均帶來了顯著的效率提升和性能改進。我們最終的模型MUG-V 10B,在整體上與近期最先進的視頻生成器相當,並在面向電子商務的視頻生成任務中,於人類評估中超越了領先的開源基線。更重要的是,我們開源了完整的技術棧,包括模型權重、基於Megatron-Core的大規模訓練代碼,以及視頻生成與增強推理管道。據我們所知,這是首次公開利用Megatron-Core實現高訓練效率及近線性多節點擴展的大規模視頻生成訓練代碼,詳情請訪問https://github.com/Shopee-MUG/MUG-V{我們的網頁}。
English
In recent years, large-scale generative models for visual content (e.g., images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V{our webpage}.
PDF92October 22, 2025