Mega-TTS 2：零-shot 文本轉語音技術，支援任意長度語音提示

摘要

零-shot 文字轉語音旨在合成具有未見過語音提示的聲音。先前的大規模多說話者 TTS 模型已成功實現此目標，並在 10 秒內完成了錄製。然而，大多數這類模型僅設計用於利用短語音提示。短語音提示中的有限信息顯著阻礙了精細身份模仿的表現。本文介紹 Mega-TTS 2，一個通用的零-shot 多說話者 TTS 模型，能夠合成未見過說話者的任意長度提示的語音。具體來說，我們 1) 設計了一個多參考音色編碼器，從多個參考語音中提取音色信息；2) 並訓練了一個具有任意長度語音提示的韻律語言模型；通過這些設計，我們的模型適用於不同長度的提示，擴展了零-shot 文字轉語音的語音品質上限。除了任意長度提示，我們引入了任意來源提示，利用從多個 P-LLM 輸出中衍生的概率來產生富有表現力和可控的韻律。此外，我們提出了一個音素級自回歸持續時間模型，將上下文學習能力引入持續時間建模。實驗表明，我們的方法不僅能夠合成具有短提示的未見過說話者的保留身份的語音，還能夠在較長的語音提示下實現改進的性能。音頻樣本可在 https://mega-tts.github.io/mega2_demo/ 找到。

English

Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple P-LLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling. Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in https://mega-tts.github.io/mega2_demo/.

Mega-TTS 2：零-shot 文本轉語音技術，支援任意長度語音提示

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

摘要

Support