Wan-S2V：音频驱动的电影级视频生成

摘要

当前，音频驱动角色动画的最先进（SOTA）方法在主要涉及说话和歌唱的场景中展现了令人瞩目的性能。然而，在更为复杂的影视制作中，这些方法往往力有未逮，因为后者要求精细的角色互动、逼真的身体动作以及动态的镜头运用。为了应对实现影视级角色动画这一长期挑战，我们提出了一种基于Wan的音频驱动模型，称之为Wan-S2V。与现有方法相比，我们的模型在电影情境下显著提升了表现力与真实感。我们进行了广泛的实验，将我们的方法与前沿模型如Hunyuan-Avatar和Omnihuman进行了对比。实验结果一致表明，我们的方法显著优于这些现有解决方案。此外，我们还通过长视频生成和精确视频唇形同步编辑等应用，探索了该方法的广泛适用性。

English

Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.

Wan-S2V：音频驱动的电影级视频生成

Wan-S2V: Audio-Driven Cinematic Video Generation

摘要

Support