Waver: 생생한 비디오 생성을 위한 파동 제어 기술

초록

우리는 통합 이미지 및 비디오 생성을 위한 고성능 기반 모델인 Waver를 소개합니다. Waver는 720p의 기본 해상도로 5초에서 10초 길이의 비디오를 직접 생성할 수 있으며, 이를 1080p로 업스케일링합니다. 이 모델은 단일 통합 프레임워크 내에서 텍스트-투-비디오(T2V), 이미지-투-비디오(I2V), 텍스트-투-이미지(T2I) 생성을 동시에 지원합니다. 우리는 모달리티 정렬을 강화하고 훈련 수렴을 가속화하기 위해 Hybrid Stream DiT 아키텍처를 도입했습니다. 훈련 데이터의 품질을 보장하기 위해, 우리는 포괄적인 데이터 큐레이션 파이프라인을 구축하고 MLLM 기반 비디오 품질 모델을 수동으로 주석 처리 및 훈련시켜 최고 품질의 샘플을 필터링했습니다. 또한, 고품질 비디오 생성을 용이하게 하기 위해 상세한 훈련 및 추론 레시피를 제공합니다. 이러한 기여를 바탕으로, Waver는 복잡한 동작을 포착하는 데 탁월하며, 비디오 합성에서 우수한 동작 범위와 시간적 일관성을 달성합니다. 특히, Artificial Analysis의 T2V 및 I2V 리더보드(2025년 7월 30일 10:00 GMT+8 기준)에서 Top 3 안에 랭크되어, 기존 오픈소스 모델을 꾸준히 능가하고 최신 상용 솔루션과 동등하거나 그 이상의 성능을 보여줍니다. 우리는 이 기술 보고서가 커뮤니티가 고품질 비디오 생성 모델을 더 효율적으로 훈련하고 비디오 생성 기술의 발전을 가속화하는 데 도움이 되기를 바랍니다. 공식 페이지: https://github.com/FoundationVision/Waver.

English

We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: https://github.com/FoundationVision/Waver.