Ada-TTA:朝向適應性高品質文本轉語音頭像合成
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
June 6, 2023
作者: Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao
cs.AI
摘要
我們對一項新穎任務感興趣,即低資源文本轉語音頭像。僅提供幾分鐘的說話人視頻,以音頻軌作為訓練數據,並使用任意文本作為輸入驅動,我們旨在合成與輸入文本對應的高質量說話肖像視頻。這項任務在數字人類行業中具有廣泛的應用前景,但由於兩個挑戰尚未在技術上實現:(1) 對於傳統多說話人文本轉語音系統來說,模仿來自跨領域音頻的音色是具有挑戰性的。(2) 在有限的訓練數據下,很難呈現高保真度和嘴唇同步的說話頭像。在本文中,我們介紹了自適應文本轉語音頭像(Ada-TTA),該方法(1) 設計了一個通用的零樣本多說話人TTS模型,能夠很好地區分文本內容、音色和語調;(2) 採用了神經渲染的最新進展,實現了逼真的音頻驅動說話臉部視頻生成。通過這些設計,我們的方法克服了上述兩個挑戰,實現了生成保持身份的語音和逼真的說話人視頻。實驗表明,我們的方法能夠合成逼真、保持身份並實現音視頻同步的說話頭像視頻。
English
We are interested in a novel task, namely low-resource text-to-talking
avatar. Given only a few-minute-long talking person video with the audio track
as the training data and arbitrary texts as the driving input, we aim to
synthesize high-quality talking portrait videos corresponding to the input
text. This task has broad application prospects in the digital human industry
but has not been technically achieved yet due to two challenges: (1) It is
challenging to mimic the timbre from out-of-domain audio for a traditional
multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and
lip-synchronized talking avatars with limited training data. In this paper, we
introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a
generic zero-shot multi-speaker TTS model that well disentangles the text
content, timbre, and prosody; and (2) embraces recent advances in neural
rendering to achieve realistic audio-driven talking face video generation. With
these designs, our method overcomes the aforementioned two challenges and
achieves to generate identity-preserving speech and realistic talking person
video. Experiments demonstrate that our method could synthesize realistic,
identity-preserving, and audio-visual synchronized talking avatar videos.