MagicInfinite: あなたの言葉と声で無限のトーキングビデオを生成

要旨

MagicInfiniteを紹介します。これは、従来のポートレートアニメーションの制約を克服し、リアルな人間、全身像、スタイリッシュなアニメキャラクターなど、多様なキャラクタータイプにわたって高忠実度の結果を提供する新しい拡散Transformer（DiT）フレームワークです。背面ビューを含むさまざまな顔のポーズをサポートし、単一または複数のキャラクターをアニメーション化し、マルチキャラクターシーンでの正確な話者指定のための入力マスクを提供します。私たちのアプローチは、以下の3つの革新により主要な課題に取り組みます：（1）スライディングウィンドウノイズ除去戦略を備えた3D全注意メカニズムにより、多様なキャラクタースタイルにわたって時間的整合性と視覚的品質を備えた無限のビデオ生成を可能にします；（2）リップシンクのためのオーディオ、表現力豊かなダイナミクスのためのテキスト、アイデンティティ保存のための参照画像を統合した2段階のカリキュラム学習スキームにより、長いシーケンスにわたる柔軟なマルチモーダル制御を可能にします；（3）グローバルなテキスト制御とローカルなオーディオガイダンスのバランスを取るための適応的損失関数を備えた地域固有のマスクにより、話者固有のアニメーションをサポートします。効率性は、革新的な統一ステップとcfg蒸留技術により向上し、ベースモデルに比べて20倍の推論速度向上を実現します：8つのH100 GPUで、10秒の540x540pビデオを10秒、720x720pビデオを30秒で生成し、品質の損失なしに実現します。新しいベンチマークでの評価により、MagicInfiniteがオーディオリップ同期、アイデンティティ保存、多様なシナリオにわたるモーションの自然さにおいて優れていることが示されています。https://www.hedra.com/で公開されており、例はhttps://magicinfinite.github.io/で確認できます。

English

We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.