DreamTalk: 表現力豊かなトーキングヘッド生成と拡散確率モデルの融合

要旨

拡散モデルは、様々な下流生成タスクで顕著な成功を収めているものの、重要な課題である表現豊かな話し頭生成においては未だ十分に探求されていません。本研究では、このギャップを埋めるためにDreamTalkフレームワークを提案し、拡散モデルの潜在能力を引き出して表現豊かな話し頭を生成するための緻密な設計を行っています。具体的には、DreamTalkは3つの重要なコンポーネントで構成されています：ノイズ除去ネットワーク、スタイルを意識したリップエキスパート、およびスタイル予測器です。拡散ベースのノイズ除去ネットワークは、多様な表情にわたって高品質な音声駆動の顔の動きを一貫して合成することができます。リップモーションの表現力と正確性を向上させるために、話し方のスタイルを意識しながらリップシンクをガイドするスタイルを意識したリップエキスパートを導入しました。表情の参照動画やテキストを不要にするために、追加の拡散ベースのスタイル予測器を使用して、音声から直接ターゲットの表情を予測します。これにより、DreamTalkは強力な拡散モデルを活用して効果的に表現豊かな顔を生成し、高価なスタイル参照への依存を軽減することができます。実験結果は、DreamTalkが多様な話し方のスタイルを持つフォトリアルな話し顔を生成し、正確なリップモーションを実現し、既存の最先端の手法を凌駕することを示しています。

English

Diffusion models have shown remarkable success in a variety of downstream generative tasks, yet remain under-explored in the important and challenging expressive talking head generation. In this work, we propose a DreamTalk framework to fulfill this gap, which employs meticulous design to unlock the potential of diffusion models in generating expressive talking heads. Specifically, DreamTalk consists of three crucial components: a denoising network, a style-aware lip expert, and a style predictor. The diffusion-based denoising network is able to consistently synthesize high-quality audio-driven face motions across diverse expressions. To enhance the expressiveness and accuracy of lip motions, we introduce a style-aware lip expert that can guide lip-sync while being mindful of the speaking styles. To eliminate the need for expression reference video or text, an extra diffusion-based style predictor is utilized to predict the target expression directly from the audio. By this means, DreamTalk can harness powerful diffusion models to generate expressive faces effectively and reduce the reliance on expensive style references. Experimental results demonstrate that DreamTalk is capable of generating photo-realistic talking faces with diverse speaking styles and achieving accurate lip motions, surpassing existing state-of-the-art counterparts.

DreamTalk: 表現力豊かなトーキングヘッド生成と拡散確率モデルの融合

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

要旨

Support