ChatPaper.aiChatPaper

Mega-TTS 2:零-shot 文本转语音与任意长度语音提示

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

July 14, 2023
作者: Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
cs.AI

摘要

零-shot 文本转语音旨在合成具有未见过语音提示的声音。先前的大规模多说话人 TTS 模型已成功在 10 秒内完成注册录音以实现此目标。然而,大多数模型仅设计用于利用短语音提示。短语音提示中的有限信息显著阻碍了精细身份模仿的性能。在本文中,我们介绍 Mega-TTS 2,这是一个通用的零-shot 多说话人 TTS 模型,能够使用任意长度的提示为未见过的说话人合成语音。具体来说,我们 1) 设计了一个多参考音色编码器,用于从多个参考语音中提取音色信息;2) 并训练了一个能够处理任意长度语音提示的韵律语言模型;通过这些设计,我们的模型适用于不同长度的提示,从而扩展了零-shot 文本转语音的语音质量上限。除了任意长度的提示,我们还引入了任意来源提示,利用从多个 P-LLM 输出中导出的概率来产生富有表现力和可控制的韵律。此外,我们提出了一个音素级自回归持续时间模型,将上下文学习能力引入到持续时间建模中。实验证明,我们的方法不仅能够合成保持身份的短提示未见说话人的语音,还能够在使用更长的语音提示时实现改进的性能。音频样本可在 https://mega-tts.github.io/mega2_demo/ 找到。
English
Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple P-LLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling. Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in https://mega-tts.github.io/mega2_demo/.
PDF2710December 15, 2024