Media2Face：多模态指导下的共语言面部动画生成

摘要

从语音合成3D面部动画引起了相当大的关注。由于高质量的4D面部数据和充分注释的多模态标签稀缺，先前的方法通常受限于有限的逼真度和缺乏灵活的条件。我们通过三部曲来解决这一挑战。首先，我们引入广义神经参数面部资产（GNPFA），这是一种高效的变分自动编码器，将面部几何和图像映射到高度广义表达潜空间，解耦表情和身份。然后，我们利用GNPFA从大量视频中提取高质量的表情和准确的头部姿势。这呈现了M2F-D数据集，这是一个大型、多样化且扫描级的共语3D面部动画数据集，具有良好注释的情感和风格标签。最后，我们提出Media2Face，这是一个扩散模型，位于GNPFA潜空间中，用于共语面部动画生成，接受来自音频、文本和图像的丰富多模态指导。大量实验证明，我们的模型不仅在面部动画合成方面实现了高保真度，还拓宽了3D面部动画中的表现力和风格适应性范围。

English

The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.

Media2Face：多模态指导下的共语言面部动画生成

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

摘要

Support