Media2Face：多模態引導下的共語面部動畫生成

摘要

從語音合成3D面部動畫已經引起相當大的關注。由於高質量的4D面部數據和豐富多模標籤的稀缺性，先前的方法通常受限於有限的真實感和缺乏靈活的條件。我們通過三部曲來應對這一挑戰。首先，我們引入了廣義神經參數化面部資產（GNPFA），這是一個高效的變分自編碼器，將面部幾何和圖像映射到高度泛化的表情潛在空間，解耦表情和身份。然後，我們利用GNPFA從大量視頻中提取高質量的表情和準確的頭部姿勢。這呈現了M2F-D數據集，一個大型、多樣化且掃描級的共語3D面部動畫數據集，具有豐富標註的情感和風格標籤。最後，我們提出了Media2Face，這是一個在GNPFA潛在空間中的擴散模型，用於共語面部動畫生成，接受來自音頻、文本和圖像的豐富多模引導。廣泛的實驗表明，我們的模型不僅在面部動畫合成方面實現了高保真度，還擴大了3D面部動畫中的表現力和風格適應性範圍。

English

The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.

Media2Face：多模態引導下的共語面部動畫生成

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

摘要

Support