DREAM-Talk:基於擴散的逼真情感音頻驅動方法,用於單張圖像說話人臉生成
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation
December 21, 2023
作者: Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng
cs.AI
摘要
從單一肖像圖像生成情感對話臉龐仍然是一個重大挑戰。實現表達豐富的情感對話和準確的嘴唇同步是特別困難的,因為通常為了嘴唇同步的準確性而犧牲了表現力。許多先前的作品廣泛採用的LSTM網絡往往無法捕捉情感表達的微妙變化。為應對這些挑戰,我們引入了DREAM-Talk,這是一個針對同時生成多樣表情和準確嘴唇同步的兩階段擴散驅動框架。在第一階段,我們提出了EmoDiff,一個新穎的擴散模塊,根據音頻和參考情感風格生成多樣高動態範圍的情感表達和頭部姿勢。鑒於嘴唇運動與音頻之間的密切相關性,我們使用音頻特徵和情感風格來提高動態表現和嘴唇同步的準確性。為此,我們部署了一個視頻到視頻渲染模塊,將表情和嘴唇運動從我們的代理3D頭像轉移到任意肖像。無論從定量還是定性方面,DREAM-Talk在表現力、嘴唇同步準確性和感知質量方面均優於最先進的方法。
English
The generation of emotional talking faces from a single portrait image
remains a significant challenge. The simultaneous achievement of expressive
emotional talking and accurate lip-sync is particularly difficult, as
expressiveness is often compromised for the accuracy of lip-sync. As widely
adopted by many prior works, the LSTM network often fails to capture the
subtleties and variations of emotional expressions. To address these
challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven
framework, tailored for generating diverse expressions and accurate lip-sync
concurrently. In the first stage, we propose EmoDiff, a novel diffusion module
that generates diverse highly dynamic emotional expressions and head poses in
accordance with the audio and the referenced emotion style. Given the strong
correlation between lip motion and audio, we then refine the dynamics with
enhanced lip-sync accuracy using audio features and emotion style. To this end,
we deploy a video-to-video rendering module to transfer the expressions and lip
motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively
and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of
expressiveness, lip-sync accuracy and perceptual quality.