MagicInfinite:用你的文字與聲音生成無限對話影片
MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice
March 7, 2025
作者: Hongwei Yi, Tian Ye, Shitong Shao, Xuancheng Yang, Jiantong Zhao, Hanzhong Guo, Terrance Wang, Qingyu Yin, Zeke Xie, Lei Zhu, Wei Li, Michael Lingelbach, Daquan Zhou
cs.AI
摘要
我們推出MagicInfinite,這是一種新穎的擴散Transformer(DiT)框架,它克服了傳統肖像動畫的限制,能夠在多樣化的角色類型——包括寫實人類、全身像以及風格化動漫角色——上實現高保真效果。該框架支持多種面部姿態,包括背對視角,並能根據輸入遮罩對單個或多個角色進行動畫處理,以便在多角色場景中精確指定發言者。我們的方法通過三項創新解決了關鍵挑戰:(1)採用3D全注意力機制與滑動窗口去噪策略,實現了具有時間連貫性和視覺質量的無限視頻生成,適用於多種角色風格;(2)實施兩階段課程學習方案,整合音頻以實現唇形同步,文本以增強表現力動態,以及參考圖像以保持身份特徵,從而實現對長序列的靈活多模態控制;(3)利用區域特定遮罩與自適應損失函數來平衡全局文本控制與局部音頻引導,支持特定發言者的動畫生成。通過我們創新的統一步驟和cfg蒸餾技術,效率得到顯著提升,相比基礎模型實現了20倍的推理速度提升:在8個H100 GPU上,10秒內生成10秒540x540p視頻或30秒內生成720x720p視頻,且無質量損失。在我們的新基準測試中,MagicInfinite在音頻-唇形同步、身份保持及動作自然度等方面展現出卓越性能,適用於多種場景。該框架已公開於https://www.hedra.com/,並在https://magicinfinite.github.io/提供示例。
English
We present MagicInfinite, a novel diffusion Transformer (DiT) framework that
overcomes traditional portrait animation limitations, delivering high-fidelity
results across diverse character types-realistic humans, full-body figures, and
stylized anime characters. It supports varied facial poses, including
back-facing views, and animates single or multiple characters with input masks
for precise speaker designation in multi-character scenes. Our approach tackles
key challenges with three innovations: (1) 3D full-attention mechanisms with a
sliding window denoising strategy, enabling infinite video generation with
temporal coherence and visual quality across diverse character styles; (2) a
two-stage curriculum learning scheme, integrating audio for lip sync, text for
expressive dynamics, and reference images for identity preservation, enabling
flexible multi-modal control over long sequences; and (3) region-specific masks
with adaptive loss functions to balance global textual control and local audio
guidance, supporting speaker-specific animations. Efficiency is enhanced via
our innovative unified step and cfg distillation techniques, achieving a 20x
inference speed boost over the basemodel: generating a 10 second 540x540p video
in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss.
Evaluations on our new benchmark demonstrate MagicInfinite's superiority in
audio-lip synchronization, identity preservation, and motion naturalness across
diverse scenarios. It is publicly available at https://www.hedra.com/, with
examples at https://magicinfinite.github.io/.Summary
AI-Generated Summary