DREAM-Talk：基于扩散的逼真情感音频驱动方法用于单图像说话人脸生成

摘要

从单个肖像图像生成情感对话面孔仍然是一个重要挑战。实现表达丰富的情感对话和准确的嘴唇同步是特别困难的，因为通常会牺牲表现力以换取嘴唇同步的准确性。正如许多先前的研究所广泛采用的那样，LSTM网络通常无法捕捉情感表达的微妙变化。为了解决这些挑战，我们引入了DREAM-Talk，这是一个针对同时生成多样表情和准确嘴唇同步的基于扩散的音频驱动框架。在第一阶段，我们提出了EmoDiff，这是一个新颖的扩散模块，根据音频和参考情感风格生成多样化且高动态的情感表达和头部姿势。鉴于嘴唇运动与音频之间存在很强的相关性，我们利用音频特征和情感风格来提高动态并增强嘴唇同步的准确性。为此，我们部署了一个视频到视频渲染模块，将表情和嘴唇运动从我们的代理3D头像转移到任意肖像中。在量化和定性方面，DREAM-Talk在表现力、嘴唇同步准确性和感知质量方面均优于最先进的方法。

English

The generation of emotional talking faces from a single portrait image remains a significant challenge. The simultaneous achievement of expressive emotional talking and accurate lip-sync is particularly difficult, as expressiveness is often compromised for the accuracy of lip-sync. As widely adopted by many prior works, the LSTM network often fails to capture the subtleties and variations of emotional expressions. To address these challenges, we introduce DREAM-Talk, a two-stage diffusion-based audio-driven framework, tailored for generating diverse expressions and accurate lip-sync concurrently. In the first stage, we propose EmoDiff, a novel diffusion module that generates diverse highly dynamic emotional expressions and head poses in accordance with the audio and the referenced emotion style. Given the strong correlation between lip motion and audio, we then refine the dynamics with enhanced lip-sync accuracy using audio features and emotion style. To this end, we deploy a video-to-video rendering module to transfer the expressions and lip motions from our proxy 3D avatar to an arbitrary portrait. Both quantitatively and qualitatively, DREAM-Talk outperforms state-of-the-art methods in terms of expressiveness, lip-sync accuracy and perceptual quality.

DREAM-Talk：基于扩散的逼真情感音频驱动方法用于单图像说话人脸生成

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

摘要

Support