Mega-TTS 2:零-shot 文本轉語音技術,支援任意長度語音提示
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts
July 14, 2023
作者: Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma, Zhou Zhao
cs.AI
摘要
零-shot 文字轉語音旨在合成具有未見過語音提示的聲音。先前的大規模多說話者 TTS 模型已成功實現此目標,並在 10 秒內完成了錄製。然而,大多數這類模型僅設計用於利用短語音提示。短語音提示中的有限信息顯著阻礙了精細身份模仿的表現。本文介紹 Mega-TTS 2,一個通用的零-shot 多說話者 TTS 模型,能夠合成未見過說話者的任意長度提示的語音。具體來說,我們 1) 設計了一個多參考音色編碼器,從多個參考語音中提取音色信息;2) 並訓練了一個具有任意長度語音提示的韻律語言模型;通過這些設計,我們的模型適用於不同長度的提示,擴展了零-shot 文字轉語音的語音品質上限。除了任意長度提示,我們引入了任意來源提示,利用從多個 P-LLM 輸出中衍生的概率來產生富有表現力和可控的韻律。此外,我們提出了一個音素級自回歸持續時間模型,將上下文學習能力引入持續時間建模。實驗表明,我們的方法不僅能夠合成具有短提示的未見過說話者的保留身份的語音,還能夠在較長的語音提示下實現改進的性能。音頻樣本可在 https://mega-tts.github.io/mega2_demo/ 找到。
English
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
prompts. Previous large-scale multispeaker TTS models have successfully
achieved this goal with an enrolled recording within 10 seconds. However, most
of them are designed to utilize only short speech prompts. The limited
information in short speech prompts significantly hinders the performance of
fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a
generic zero-shot multispeaker TTS model that is capable of synthesizing speech
for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a
multi-reference timbre encoder to extract timbre information from multiple
reference speeches; 2) and train a prosody language model with arbitrary-length
speech prompts; With these designs, our model is suitable for prompts of
different lengths, which extends the upper bound of speech quality for
zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce
arbitrary-source prompts, which leverages the probabilities derived from
multiple P-LLM outputs to produce expressive and controlled prosody.
Furthermore, we propose a phoneme-level auto-regressive duration model to
introduce in-context learning capabilities to duration modeling. Experiments
demonstrate that our method could not only synthesize identity-preserving
speech with a short prompt of an unseen speaker but also achieve improved
performance with longer speech prompts. Audio samples can be found in
https://mega-tts.github.io/mega2_demo/.