Wan:開放且先進的大規模視頻生成模型
Wan: Open and Advanced Large-Scale Video Generative Models
March 26, 2025
作者: WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
cs.AI
摘要
本報告介紹了Wan,這是一套全面且開放的視頻基礎模型套件,旨在突破視頻生成的界限。基於主流的擴散變換器範式,Wan通過一系列創新實現了生成能力的顯著提升,包括我們新穎的VAE、可擴展的預訓練策略、大規模數據策展以及自動化評估指標。這些貢獻共同增強了模型的性能和多功能性。具體而言,Wan具有四個關鍵特徵:領先性能:Wan的14B模型在包含數十億圖像和視頻的龐大數據集上訓練,展示了視頻生成在數據和模型規模方面的擴展規律。它在多個內部和外部基準測試中始終優於現有的開源模型以及最先進的商業解決方案,展現出明顯且顯著的性能優勢。全面性:Wan提供了兩個能力強大的模型,即1.3B和14B參數,分別針對效率和效果。它還涵蓋了多個下游應用,包括圖像到視頻、指令引導的視頻編輯和個人視頻生成,涵蓋多達八項任務。消費級效率:1.3B模型展示了卓越的資源效率,僅需8.19 GB的顯存,使其兼容廣泛的消費級GPU。開放性:我們開源了Wan的整個系列,包括源代碼和所有模型,旨在促進視頻生成社區的發展。這種開放性旨在顯著擴展行業中視頻製作的創意可能性,並為學術界提供高質量的視頻基礎模型。所有代碼和模型均可通過https://github.com/Wan-Video/Wan2.1獲取。
English
This report presents Wan, a comprehensive and open suite of video foundation
models designed to push the boundaries of video generation. Built upon the
mainstream diffusion transformer paradigm, Wan achieves significant
advancements in generative capabilities through a series of innovations,
including our novel VAE, scalable pre-training strategies, large-scale data
curation, and automated evaluation metrics. These contributions collectively
enhance the model's performance and versatility. Specifically, Wan is
characterized by four key features: Leading Performance: The 14B model of Wan,
trained on a vast dataset comprising billions of images and videos,
demonstrates the scaling laws of video generation with respect to both data and
model size. It consistently outperforms the existing open-source models as well
as state-of-the-art commercial solutions across multiple internal and external
benchmarks, demonstrating a clear and significant performance superiority.
Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B
parameters, for efficiency and effectiveness respectively. It also covers
multiple downstream applications, including image-to-video, instruction-guided
video editing, and personal video generation, encompassing up to eight tasks.
Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource
efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range
of consumer-grade GPUs. Openness: We open-source the entire series of Wan,
including source code and all models, with the goal of fostering the growth of
the video generation community. This openness seeks to significantly expand the
creative possibilities of video production in the industry and provide academia
with high-quality video foundation models. All the code and models are
available at https://github.com/Wan-Video/Wan2.1.Summary
AI-Generated Summary