梦话：当具表现力的说话头生成遇见扩散概率模型

摘要

扩散模型在各种下游生成任务中取得了显著成功，但在重要且具有挑战性的表达式说话头像生成领域仍未得到充分探索。在这项工作中，我们提出了一个名为DreamTalk 的框架来填补这一空白，通过精心设计来释放扩散模型在生成富有表现力的说话头像方面的潜力。具体而言，DreamTalk 包括三个关键组件：一个去噪网络，一个风格感知的唇部专家和一个风格预测器。基于扩散的去噪网络能够稳定地合成出多样表情下的高质量音频驱动面部动作。为增强唇部动作的表现力和准确性，我们引入了一个风格感知的唇部专家，能够在引导唇同步的同时注意说话风格。为了消除对表情参考视频或文本的需求，额外使用了基于扩散的风格预测器，直接从音频中预测目标表情。通过这种方式，DreamTalk 能够利用强大的扩散模型有效生成富有表现力的面部，并减少对昂贵风格参考的依赖。实验结果表明，DreamTalk 能够生成具有多样说话风格的逼真说话头像，并实现准确的唇部动作，超越现有的最先进方法。

English

Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.

梦话：当具表现力的说话头生成遇见扩散概率模型

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

摘要

Support