DREAM-Talk:基于扩散的逼真情感音频驱动方法用于单图像说话人脸生成
DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation
December 21, 2023
作者: Chenxu Zhang, Chao Wang, Jianfeng Zhang, Hongyi Xu, Guoxian Song, You Xie, Linjie Luo, Yapeng Tian, Xiaohu Guo, Jiashi Feng
cs.AI
摘要
从单个肖像图像生成情感对话面孔仍然是一个重要挑战。实现表达丰富的情感对话和准确的嘴唇同步是特别困难的,因为通常会牺牲表现力以换取嘴唇同步的准确性。正如许多先前的研究所广泛采用的那样,LSTM网络通常无法捕捉情感表达的微妙变化。为了解决这些挑战,我们引入了DREAM-Talk,这是一个针对同时生成多样表情和准确嘴唇同步的基于扩散的音频驱动框架。在第一阶段,我们提出了EmoDiff,这是一个新颖的扩散模块,根据音频和参考情感风格生成多样化且高动态的情感表达和头部姿势。鉴于嘴唇运动与音频之间存在很强的相关性,我们利用音频特征和情感风格来提高动态并增强嘴唇同步的准确性。为此,我们部署了一个视频到视频渲染模块,将表情和嘴唇运动从我们的代理3D头像转移到任意肖像中。在量化和定性方面,DREAM-Talk在表现力、嘴唇同步准确性和感知质量方面均优于最先进的方法。
English
The generation of emotional talking faces from a single portrait image
remains a significant challenge. The simultaneous achievement of expressive
emotional talking and accurate lip-sync is particularly difficult, as
expressiveness is often compromised for the accuracy of lip-sync. As widely
adopted by many prior works, the LSTM network often fails to capture the
subtleties and variations of emotional expressions. To address these
challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven
framework, tailored for generating diverse expressions and accurate lip-sync
concurrently. In the first stage, we propose EmoDiff, a novel diffusion module
that generates diverse highly dynamic emotional expressions and head poses in
accordance with the audio and the referenced emotion style. Given the strong
correlation between lip motion and audio, we then refine the dynamics with
enhanced lip-sync accuracy using audio features and emotion style. To this end,
we deploy a video-to-video rendering module to transfer the expressions and lip
motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively
and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of
expressiveness, lip-sync accuracy and perceptual quality.