Wan: オープンで先進的な大規模ビデオ生成モデル

要旨

本報告書は、ビデオ生成の限界を押し広げるために設計された包括的かつオープンなビデオ基盤モデルスイート「Wan」を紹介する。主流の拡散トランスフォーマーパラダイムを基盤として構築されたWanは、新規のVAE、スケーラブルな事前学習戦略、大規模なデータキュレーション、自動化された評価指標といった一連のイノベーションを通じて、生成能力において大きな進歩を達成している。これらの貢献が相まって、モデルの性能と汎用性が向上している。具体的には、Wanは以下の4つの主要な特徴を有する：リーディングパフォーマンス：数十億枚の画像とビデオを含む大規模なデータセットで学習された14Bモデルは、データとモデルサイズに関するビデオ生成のスケーリング則を示しており、複数の内部および外部ベンチマークにおいて既存のオープンソースモデルや最先端の商用ソリューションを一貫して上回り、明確かつ顕著な性能優位性を実証している。包括性：Wanは、効率性と有効性のためにそれぞれ1.3Bと14Bパラメータの2つの有能なモデルを提供する。また、画像からビデオ、指示に基づくビデオ編集、個人向けビデオ生成など、最大8つのタスクをカバーする複数の下流アプリケーションを包含している。コンシューマーグレードの効率性：1.3Bモデルは、8.19GBのVRAMのみを必要とする卓越したリソース効率性を示し、幅広いコンシューマーグレードのGPUとの互換性を実現している。オープン性：ビデオ生成コミュニティの成長を促進することを目的として、ソースコードとすべてのモデルを含むWanシリーズ全体をオープンソース化する。このオープン性は、業界におけるビデオ制作の創造的可能性を大幅に拡大し、学界に高品質なビデオ基盤モデルを提供することを目指している。すべてのコードとモデルはhttps://github.com/Wan-Video/Wan2.1で公開されている。

English

This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

Wan: オープンで先進的な大規模ビデオ生成モデル

Wan: Open and Advanced Large-Scale Video Generative Models

要旨

Support