StableAvatar: 無限長音声駆動型アバター動画生成

要旨

現在の音声駆動アバタービデオ生成のための拡散モデルは、自然な音声同期とアイデンティティの一貫性を保ちつつ長いビデオを合成するのに苦戦している。本論文では、後処理なしで無限の長さの高品質ビデオを合成する初のエンドツーエンドのビデオ拡散トランスフォーマーであるStableAvatarを提案する。参照画像と音声を条件として、StableAvatarは無限長ビデオ生成を可能にするための専用のトレーニングと推論モジュールを統合している。既存のモデルが長いビデオを生成できない主な理由は、その音声モデリングにあることが観察された。これらのモデルは通常、サードパーティの既成の抽出器を使用して音声埋め込みを取得し、それをクロスアテンションを介して拡散モデルに直接注入する。現在の拡散バックボーンには音声関連の事前知識が欠如しているため、このアプローチはビデオクリップ間で潜在分布の誤差が蓄積し、後続のセグメントの潜在分布が最適分布から徐々に逸脱する原因となる。これを解決するため、StableAvatarは、時間ステップを意識した変調により誤差蓄積を防ぐ新しいTime-step-aware Audio Adapterを導入する。推論時には、拡散モデル自身の進化する音声-潜在予測を動的なガイダンス信号として活用することで、音声同期をさらに強化する新しいAudio Native Guidance Mechanismを提案する。無限長ビデオの滑らかさを向上させるために、時間経過に伴う潜在を融合するDynamic Weighted Sliding-window Strategyを導入する。ベンチマークでの実験により、StableAvatarの有効性が定性的および定量的に示された。

English

Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

StableAvatar: 無限長音声駆動型アバター動画生成

StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

要旨

Support