Media2Face:多模态指导下的共语言面部动画生成
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance
January 28, 2024
作者: Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu
cs.AI
摘要
从语音合成3D面部动画引起了相当大的关注。由于高质量的4D面部数据和充分注释的多模态标签稀缺,先前的方法通常受限于有限的逼真度和缺乏灵活的条件。我们通过三部曲来解决这一挑战。首先,我们引入广义神经参数面部资产(GNPFA),这是一种高效的变分自动编码器,将面部几何和图像映射到高度广义表达潜空间,解耦表情和身份。然后,我们利用GNPFA从大量视频中提取高质量的表情和准确的头部姿势。这呈现了M2F-D数据集,这是一个大型、多样化且扫描级的共语3D面部动画数据集,具有良好注释的情感和风格标签。最后,我们提出Media2Face,这是一个扩散模型,位于GNPFA潜空间中,用于共语面部动画生成,接受来自音频、文本和图像的丰富多模态指导。大量实验证明,我们的模型不仅在面部动画合成方面实现了高保真度,还拓宽了3D面部动画中的表现力和风格适应性范围。
English
The synthesis of 3D facial animations from speech has garnered considerable
attention. Due to the scarcity of high-quality 4D facial data and
well-annotated abundant multi-modality labels, previous methods often suffer
from limited realism and a lack of lexible conditioning. We address this
challenge through a trilogy. We first introduce Generalized Neural Parametric
Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial
geometry and images to a highly generalized expression latent space, decoupling
expressions and identities. Then, we utilize GNPFA to extract high-quality
expressions and accurate head poses from a large array of videos. This presents
the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial
animation dataset with well-annotated emotional and style labels. Finally, we
propose Media2Face, a diffusion model in GNPFA latent space for co-speech
facial animation generation, accepting rich multi-modality guidances from
audio, text, and image. Extensive experiments demonstrate that our model not
only achieves high fidelity in facial animation synthesis but also broadens the
scope of expressiveness and style adaptability in 3D facial animation.