Waver: リアルな動画生成への新たな波

要旨

私たちは、統一された画像および動画生成のための高性能基盤モデルであるWaverを紹介します。Waverは、5秒から10秒の動画を720pのネイティブ解像度で直接生成し、その後1080pにアップスケールすることができます。このモデルは、テキストから動画（T2V）、画像から動画（I2V）、テキストから画像（T2I）の生成を単一の統合フレームワーク内で同時にサポートします。モダリティの整合性を強化し、トレーニングの収束を加速するために、Hybrid Stream DiTアーキテクチャを導入しました。トレーニングデータの品質を確保するために、包括的なデータキュレーションパイプラインを確立し、MLLMベースの動画品質モデルを手動で注釈付けしてトレーニングし、最高品質のサンプルをフィルタリングします。さらに、高品質な動画の生成を容易にするために、詳細なトレーニングと推論のレシピを提供します。これらの貢献を基に、Waverは複雑な動きを捉えることに優れており、動画合成において優れた動きの振幅と時間的一貫性を実現します。特に、Artificial AnalysisのT2VおよびI2Vリーダーボード（2025年7月30日10:00 GMT+8時点のデータ）でトップ3にランクインし、既存のオープンソースモデルを一貫して上回り、最先端の商用ソリューションに匹敵またはそれを上回る性能を示しています。この技術レポートが、コミュニティが高品質な動画生成モデルを効率的にトレーニングし、動画生成技術の進歩を加速するのに役立つことを願っています。公式ページ：https://github.com/FoundationVision/Waver。

English

We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.