UniTalker:透過統一模型擴展音訊驅動的3D面部動畫
UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model
August 1, 2024
作者: Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, Lei Yang
cs.AI
摘要
音頻驅動的3D面部動畫旨在將輸入音頻映射到逼真的面部運動。儘管取得了顯著進展,但由於不一致的3D標註,以往模型僅能在特定標註上進行訓練,從而限制了訓練規模。在這項工作中,我們提出了UniTalker,這是一個統一的模型,具有多頭架構,旨在有效利用具有不同標註的數據集。為了增強訓練穩定性並確保多頭輸出之間的一致性,我們採用了三種訓練策略,即PCA、模型預熱和枢軸身份嵌入。為擴大訓練規模和多樣性,我們組建了A2F-Bench,包括五個公開數據集和三個新編數據集。這些數據集涵蓋廣泛的音頻領域,包括多語言演講聲音和歌曲,從而將訓練數據從通常少於1小時的數據集擴展到18.5小時。通過單個訓練的UniTalker模型,我們實現了BIWI數據集的顯著唇頂誤差降低9.2%,以及Vocaset的13.7%。此外,預訓練的UniTalker展示了作為音頻驅動面部動畫任務基礎模型的潛力。在已見數據集上對預訓練的UniTalker進行微調進一步提高了每個數據集的性能,A2F-Bench平均誤差降低了6.3%。此外,對只有一半數據的未見數據集上進行UniTalker微調超越了以完整數據集進行訓練的先前最先進模型。代碼和數據集可在項目頁面https://github.com/X-niper/UniTalker找到。
English
Audio-driven 3D facial animation aims to map input audio to realistic facial
motion. Despite significant progress, limitations arise from inconsistent 3D
annotations, restricting previous models to training on specific annotations
and thereby constraining the training scale. In this work, we present
UniTalker, a unified model featuring a multi-head architecture designed to
effectively leverage datasets with varied annotations. To enhance training
stability and ensure consistency among multi-head outputs, we employ three
training strategies, namely, PCA, model warm-up, and pivot identity embedding.
To expand the training scale and diversity, we assemble A2F-Bench, comprising
five publicly available datasets and three newly curated datasets. These
datasets contain a wide range of audio domains, covering multilingual speech
voices and songs, thereby scaling the training data from commonly employed
datasets, typically less than 1 hour, to 18.5 hours. With a single trained
UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for
BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker
exhibits promise as the foundation model for audio-driven facial animation
tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances
performance on each dataset, with an average error reduction of 6.3% on
A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half
the data surpasses prior state-of-the-art models trained on the full dataset.
The code and dataset are available at the project page
https://github.com/X-niper/UniTalker.Summary
AI-Generated Summary