YingVideo-MV: 音楽駆動型マルチステージ動画生成

要旨

オーディオ駆動アバター動画生成における拡散モデルは、自然な音声-視覚同期とアイデンティティ一貫性を備えた長尺シーケンスの合成において顕著な進歩を遂げているが、カメラ運動を伴う音楽演奏動画の生成は未だほとんど未開拓の領域である。本論文では、音楽駆動の長尺動画生成における最初のカスケードフレームワークであるYingVideo-MVを提案する。本手法は、音声音響信号から高品質な音楽演奏動画を自動合成するために、音声音響意味解析、解釈可能なショット計画モジュール（MV-Director）、時間認識拡散Transformerアーキテクチャ、および長尺シーケンス一貫性モデリングを統合する。多様で高品質な結果の達成を支援するため、ウェブデータを収集して大規模なMusic-in-the-Wildデータセットを構築した。既存の長尺動画生成手法には明示的なカメラ運動制御が欠如していることを踏まえ、カメラ姿勢を潜在ノイズに埋め込むカメラアダプタモジュールを導入する。長尺シーケンス推論におけるクリップ間の連続性を高めるため、音声音響埋め込みに基づいてノイズ除去範囲を適応的に調整する時間認識動的ウィンドウ範囲戦略をさらに提案する。包括的なベンチマークテストにより、YingVideo-MVが一貫性と表現力に富むミュージックビデオの生成において優れた性能を達成し、音楽-動作-カメラの精密な同期を実現することを実証した。詳細な動画はプロジェクトページ（https://giantailab.github.io/YingVideo-MV/ ）で公開している。

English

While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .

YingVideo-MV: 音楽駆動型マルチステージ動画生成

YingVideo-MV: Music-Driven Multi-Stage Video Generation

要旨

Support