ChatPaper.aiChatPaper

Wan-S2V:音頻驅動的電影級視頻生成

Wan-S2V: Audio-Driven Cinematic Video Generation

August 26, 2025
作者: Xin Gao, Li Hu, Siqi Hu, Mingyang Huang, Chaonan Ji, Dechao Meng, Jinwei Qi, Penchong Qiao, Zhen Shen, Yafei Song, Ke Sun, Linrui Tian, Guangyuan Wang, Qi Wang, Zhongjian Wang, Jiayu Xiao, Sheng Xu, Bang Zhang, Peng Zhang, Xindi Zhang, Zhe Zhang, Jingren Zhou, Lian Zhuo
cs.AI

摘要

當前最先進的音頻驅動角色動畫方法在主要涉及說話和歌唱的場景中展現出令人期待的表現。然而,在更為複雜的影視製作中,這些方法往往力有未逮,因為這些製作需要精細的角色互動、逼真的身體動作以及動態的攝影技巧。為了解決這一長期存在的挑戰,即實現電影級別的角色動畫,我們提出了一種基於Wan的音頻驅動模型,我們稱之為Wan-S2V。與現有方法相比,我們的模型在電影情境中顯著提升了表現力和真實感。我們進行了廣泛的實驗,將我們的方法與如Hunyuan-Avatar和Omnihuman等尖端模型進行了基準測試。實驗結果一致表明,我們的方法顯著優於這些現有解決方案。此外,我們還通過長視頻生成和精確的視頻唇形同步編輯等應用,探索了我們方法的廣泛適用性。
English
Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.
PDF31August 27, 2025