Waver：挥动你的方式，实现逼真视频生成

摘要

我们推出Waver，一款高性能的统一图像与视频生成基础模型。Waver能够直接生成长度在5至10秒之间、原生分辨率为720p的视频，随后可升级至1080p。该模型在一个集成框架内同时支持文本到视频（T2V）、图像到视频（I2V）及文本到图像（T2I）的生成任务。我们引入了混合流式DiT架构，以增强模态对齐并加速训练收敛。为确保训练数据质量，我们建立了一套全面的数据筛选流程，并手动标注并训练了一个基于MLLM的视频质量模型，用于筛选最高质量的样本。此外，我们提供了详细的训练与推理方案，以促进高质量视频的生成。基于这些贡献，Waver在捕捉复杂运动方面表现出色，在视频合成中实现了卓越的运动幅度与时间一致性。值得注意的是，截至2025年7月30日10:00 GMT+8，在Artificial Analysis的T2V与I2V排行榜上，Waver均位列前三，持续超越现有开源模型，并与或超越最先进的商业解决方案相媲美。我们希望这份技术报告能帮助社区更高效地训练高质量视频生成模型，并加速视频生成技术的进步。官方页面：https://github.com/FoundationVision/Waver。

English

We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.