ChatPaper.aiChatPaper

MUG-V 10B:面向大规模视频生成模型的高效训练管道

MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

October 20, 2025
作者: Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng
cs.AI

摘要

近年来,面向视觉内容(如图像、视频及三维物体/场景)的大规模生成模型取得了显著进展。然而,由于跨模态文本-视频对齐、长序列处理以及复杂的时空依赖性,训练大规模视频生成模型仍面临巨大挑战且资源消耗巨大。为应对这些挑战,我们提出了一套训练框架,该框架围绕四大支柱进行优化:(i) 数据处理,(ii) 模型架构,(iii) 训练策略,以及 (iv) 大规模视频生成模型的基础设施。这些优化措施在数据预处理、视频压缩、参数扩展、基于课程学习的预训练及对齐导向的后训练等各个环节均实现了显著的效率提升与性能改进。我们最终得到的模型——MUG-V 10B,在整体上媲美近期最先进的视频生成器,并在电商导向的视频生成任务中,于人类评估中超越了领先的开源基线模型。尤为重要的是,我们开源了完整的技术栈,包括模型权重、基于Megatron-Core的大规模训练代码,以及视频生成与增强的推理流程。据我们所知,这是首次公开利用Megatron-Core实现高训练效率与近乎线性的多节点扩展的大规模视频生成训练代码,详情请访问https://github.com/Shopee-MUG/MUG-V{我们的网页}。
English
In recent years, large-scale generative models for visual content (e.g., images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V{our webpage}.
PDF92October 22, 2025