Wan:开放且先进的大规模视频生成模型
Wan: Open and Advanced Large-Scale Video Generative Models
March 26, 2025
作者: WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
cs.AI
摘要
本报告介绍了Wan,一套全面开放的视频基础模型套件,旨在突破视频生成的边界。基于主流的扩散变换器范式,Wan通过一系列创新实现了生成能力的显著提升,包括我们新颖的VAE架构、可扩展的预训练策略、大规模数据筛选以及自动化评估指标。这些贡献共同增强了模型的性能和多功能性。具体而言,Wan具备四大核心特征:领先性能:Wan的140亿参数模型在包含数十亿图像和视频的庞大数据集上训练,展示了视频生成在数据和模型规模上的扩展规律。它在多个内部和外部基准测试中持续超越现有开源模型及最先进的商业解决方案,展现出显著且明确的性能优势。全面性:Wan提供两个高效能模型,分别为13亿和140亿参数,分别针对效率与效果优化。它覆盖了包括图像转视频、指令引导的视频编辑及个性化视频生成在内的多达八项下游应用。消费级效率:13亿参数模型展现出卓越的资源效率,仅需8.19GB显存,兼容广泛的消费级GPU。开放性:我们开源了Wan全系列,包括源代码及所有模型,旨在促进视频生成社区的发展。这一开放性举措力求大幅扩展行业视频制作的创意可能性,并为学术界提供高质量的视频基础模型。所有代码和模型均可在https://github.com/Wan-Video/Wan2.1获取。
English
This report presents Wan, a comprehensive and open suite of video foundation
models designed to push the boundaries of video generation. Built upon the
mainstream diffusion transformer paradigm, Wan achieves significant
advancements in generative capabilities through a series of innovations,
including our novel VAE, scalable pre-training strategies, large-scale data
curation, and automated evaluation metrics. These contributions collectively
enhance the model's performance and versatility. Specifically, Wan is
characterized by four key features: Leading Performance: The 14B model of Wan,
trained on a vast dataset comprising billions of images and videos,
demonstrates the scaling laws of video generation with respect to both data and
model size. It consistently outperforms the existing open-source models as well
as state-of-the-art commercial solutions across multiple internal and external
benchmarks, demonstrating a clear and significant performance superiority.
Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B
parameters, for efficiency and effectiveness respectively. It also covers
multiple downstream applications, including image-to-video, instruction-guided
video editing, and personal video generation, encompassing up to eight tasks.
Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource
efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range
of consumer-grade GPUs. Openness: We open-source the entire series of Wan,
including source code and all models, with the goal of fostering the growth of
the video generation community. This openness seeks to significantly expand the
creative possibilities of video production in the industry and provide academia
with high-quality video foundation models. All the code and models are
available at https://github.com/Wan-Video/Wan2.1.Summary
AI-Generated Summary