Media2Face: マルチモダリティガイダンスによる共話的面部アニメーション生成

要旨

音声から3D顔面アニメーションを合成する技術は、大きな注目を集めています。高品質な4D顔面データや十分に注釈された多様なマルチモーダルラベルの不足により、従来の手法はリアリズムの限界や柔軟な条件付けの欠如に悩まされてきました。私たちはこの課題を三部作で解決します。まず、Generalized Neural Parametric Facial Asset (GNPFA)を導入します。これは、顔の形状と画像を高度に一般化された表情の潜在空間にマッピングする効率的な変分オートエンコーダであり、表情とアイデンティティを分離します。次に、GNPFAを利用して、多数のビデオから高品質な表情と正確な頭部姿勢を抽出します。これにより、M2F-Dデータセットが作成されます。これは、感情やスタイルのラベルが十分に注釈された、大規模で多様なスキャンレベルの共話3D顔面アニメーションデータセットです。最後に、GNPFA潜在空間内での共話顔面アニメーション生成のための拡散モデルであるMedia2Faceを提案します。このモデルは、音声、テキスト、画像からの豊富なマルチモーダルガイダンスを受け入れます。広範な実験により、私たちのモデルが顔面アニメーション合成において高い忠実度を達成するだけでなく、3D顔面アニメーションの表現力とスタイル適応性の範囲を広げることが実証されました。

English

The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.

Media2Face: マルチモダリティガイダンスによる共話的面部アニメーション生成

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

要旨

Support