EMO2:末端效應器引導的音訊驅動化身影片生成
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
January 18, 2025
作者: Linrui Tian, Siqi Hu, Qi Wang, Bang Zhang, Liefeng Bo
cs.AI
摘要
本文提出了一種新穎的音頻驅動的說話頭部方法,能夠同時生成高度表現豐富的面部表情和手勢。與現有方法專注於生成全身或半身姿勢不同,我們研究了共語手勢生成的挑戰,並確定音頻特徵與全身手勢之間的薄弱對應是一個關鍵限制。為了應對這一問題,我們將任務重新定義為一個兩階段過程。在第一階段,我們直接從音頻輸入生成手部姿勢,利用音頻信號與手部運動之間的強相關性。在第二階段,我們採用擴散模型合成視頻幀,將第一階段生成的手部姿勢納入,以產生逼真的面部表情和身體動作。我們的實驗結果表明,所提出的方法在視覺質量和同步精度方面優於CyberHost和Vlogger等最先進的方法。這項工作為音頻驅動的手勢生成提供了新的視角,並提供了一個強大的框架,用於創建富有表現力和自然的說話頭部動畫。
English
In this paper, we propose a novel audio-driven talking head method capable of
simultaneously generating highly expressive facial expressions and hand
gestures. Unlike existing methods that focus on generating full-body or
half-body poses, we investigate the challenges of co-speech gesture generation
and identify the weak correspondence between audio features and full-body
gestures as a key limitation. To address this, we redefine the task as a
two-stage process. In the first stage, we generate hand poses directly from
audio input, leveraging the strong correlation between audio signals and hand
movements. In the second stage, we employ a diffusion model to synthesize video
frames, incorporating the hand poses generated in the first stage to produce
realistic facial expressions and body movements. Our experimental results
demonstrate that the proposed method outperforms state-of-the-art approaches,
such as CyberHost and Vlogger, in terms of both visual quality and
synchronization accuracy. This work provides a new perspective on audio-driven
gesture generation and a robust framework for creating expressive and natural
talking head animations.Summary
AI-Generated Summary