Ada-TTA:迈向自适应高质量文本到语音化头像合成
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis
June 6, 2023
作者: Zhenhui Ye, Ziyue Jiang, Yi Ren, Jinglin Liu, Chen Zhang, Xiang Yin, Zejun Ma, Zhou Zhao
cs.AI
摘要
我们对一项新颖的任务感兴趣,即低资源文本转语音头像。仅提供几分钟长的说话人视频作为训练数据,音频轨道作为输入,我们旨在合成与输入文本对应的高质量说话头像视频。这一任务在数字人类产业中具有广泛的应用前景,但由于两个挑战,目前尚未在技术上实现:(1)对于传统的多说话人文本转语音系统来说,模仿来自领域外音频的音色是具有挑战性的。 (2)使用有限的训练数据渲染高保真度和唇部同步的说话头像是困难的。在本文中,我们介绍了自适应文本转语音头像(Ada-TTA),它(1)设计了一个通用的零样本多说话人TTS模型,能够很好地区分文本内容、音色和语调;(2)结合了神经渲染的最新进展,实现了逼真的音频驱动说话面部视频生成。通过这些设计,我们的方法克服了上述两个挑战,并实现了生成保持身份的语音和逼真的说话人视频。实验证明,我们的方法能够合成逼真、保持身份和音视频同步的说话头像视频。
English
We are interested in a novel task, namely low-resource text-to-talking
avatar. Given only a few-minute-long talking person video with the audio track
as the training data and arbitrary texts as the driving input, we aim to
synthesize high-quality talking portrait videos corresponding to the input
text. This task has broad application prospects in the digital human industry
but has not been technically achieved yet due to two challenges: (1) It is
challenging to mimic the timbre from out-of-domain audio for a traditional
multi-speaker Text-to-Speech system. (2) It is hard to render high-fidelity and
lip-synchronized talking avatars with limited training data. In this paper, we
introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which (1) designs a
generic zero-shot multi-speaker TTS model that well disentangles the text
content, timbre, and prosody; and (2) embraces recent advances in neural
rendering to achieve realistic audio-driven talking face video generation. With
these designs, our method overcomes the aforementioned two challenges and
achieves to generate identity-preserving speech and realistic talking person
video. Experiments demonstrate that our method could synthesize realistic,
identity-preserving, and audio-visual synchronized talking avatar videos.