Waver：揮動你的方式，實現逼真視頻生成

摘要

我們推出Waver，這是一個用於統一圖像與視頻生成的高性能基礎模型。Waver能夠直接生成時長介於5至10秒、原生分辨率為720p的視頻，隨後可將其提升至1080p。該模型在單一整合框架內，同時支持文本到視頻（T2V）、圖像到視頻（I2V）以及文本到圖像（T2I）的生成。我們引入了一種混合流式DiT架構，以增強模態對齊並加速訓練收斂。為了確保訓練數據質量，我們建立了一套全面的數據篩選流程，並手動標註並訓練了一個基於MLLM的視頻質量模型，用於篩選出最高質量的樣本。此外，我們提供了詳細的訓練與推理指南，以促進高質量視頻的生成。基於這些貢獻，Waver在捕捉複雜運動方面表現卓越，在視頻合成中實現了優異的運動幅度與時間一致性。值得注意的是，在Artificial Analysis的T2V和I2V排行榜上（數據截至2025年7月30日10:00 GMT+8），Waver均位列前三，持續超越現有的開源模型，並與或超越最先進的商業解決方案相媲美。我們希望這份技術報告能幫助社區更高效地訓練高質量視頻生成模型，並加速視頻生成技術的進步。官方頁面：https://github.com/FoundationVision/Waver。

English

We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.

Waver：揮動你的方式，實現逼真視頻生成

Waver: Wave Your Way to Lifelike Video Generation

摘要

Support