夢話：當表達豐富的說話頭生成遇見擴散概率模型

摘要

擴散模型在各種後續生成任務中取得了顯著成功，但在重要且具挑戰性的表情豐富的說話頭部生成方面仍未得到充分探索。在這項工作中，我們提出了一個名為DreamTalk 的框架來填補這一空白，該框架採用細緻的設計來發揮擴散模型在生成表情豐富的說話頭部方面的潛力。具體而言，DreamTalk 包括三個關鍵組件：一個去噪網絡、一個風格感知的唇部專家和一個風格預測器。基於擴散的去噪網絡能夠穩定地合成出具有多樣表情的高質量音頻驅動的面部運動。為了增強唇部運動的表現力和準確性，我們引入了一個風格感知的唇部專家，可以在引導唇調同步的同時注意講話風格。為了消除對表情參考視頻或文本的需求，我們使用了額外的基於擴散的風格預測器，可以直接從音頻中預測目標表情。通過這種方式，DreamTalk 能夠利用強大的擴散模型有效生成表情豐富的面部，並減少對昂貴風格參考的依賴。實驗結果表明，DreamTalk 能夠生成具有多樣說話風格的逼真說話面部，實現準確的唇部運動，超越現有的最先進對手。

English

Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.

夢話：當表達豐富的說話頭生成遇見擴散概率模型

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

摘要

Support