DAWN: 非自己回帰拡散フレームワークを用いた動的フレームアバターによるトーキングヘッドビデオ生成

要旨

Talking head generationは、単一の肖像画と音声オーディオクリップから鮮明でリアルな話す頭部ビデオを生成することを意図しています。拡散ベースのTalking head generationで重要な進展がなされてきましたが、ほとんどの手法が自己回帰戦略に依存しており、現在の生成ステップを超えた限られたコンテキスト利用、誤差蓄積、および遅い生成速度に苦しんでいます。これらの課題に対処するために、私たちはDAWN（Dynamic frame Avatar With Non-autoregressive diffusion）を提案します。これは、動的長ビデオシーケンスの一度にすべて生成を可能にするフレームワークです。具体的には、主に2つの主要なコンポーネントで構成されています：（1）潜在運動空間での音声駆動の包括的な顔のダイナミクス生成、および（2）音声駆動のヘッドポーズおよびまばたき生成。多くの実験により、当社の手法が正確な口の動きと自然なポーズ/まばたきの動きを持つ本物で鮮明なビデオを生成することが示されています。さらに、高速な生成速度で、DAWNは強力な外挿能力を持ち、高品質な長いビデオの安定した製作を保証します。これらの結果は、Talking head video generationの分野でDAWNの著しい約束と潜在的な影響を強調しています。さらに、私たちは、DAWNが拡散モデルにおける非自己回帰アプローチのさらなる探索を刺激することを期待しています。当社のコードは、https://github.com/Hanbo-Cheng/DAWN-pytorch で一般に公開されます。

English

Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly at https://github.com/Hanbo-Cheng/DAWN-pytorch.

DAWN: 非自己回帰拡散フレームワークを用いた動的フレームアバターによるトーキングヘッドビデオ生成

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

要旨

Support