Mega-TTS 2: 임의 길이 음성 프롬프트를 활용한 제로샷 텍스트-투-스피치

초록

제로샷 텍스트-투-스피치(Zero-shot text-to-speech)는 보이지 않은 음성 프롬프트로 음성을 합성하는 것을 목표로 합니다. 이전의 대규모 다중 화자 TTS 모델들은 10초 이내의 등록된 녹음을 통해 이 목표를 성공적으로 달성했습니다. 그러나 대부분의 모델은 짧은 음성 프롬프트만을 활용하도록 설계되었습니다. 짧은 음성 프롬프트의 제한된 정보는 세밀한 정체성 모방의 성능을 크게 저해합니다. 본 논문에서는 임의 길이의 프롬프트로 보이지 않은 화자의 음성을 합성할 수 있는 일반적인 제로샷 다중 화자 TTS 모델인 Mega-TTS 2를 소개합니다. 구체적으로, 우리는 1) 다중 참조 음색 인코더를 설계하여 여러 참조 음성에서 음색 정보를 추출하고, 2) 임의 길이의 음성 프롬프트로 프로소디 언어 모델을 훈련합니다. 이러한 설계를 통해 우리의 모델은 다양한 길이의 프롬프트에 적합하며, 제로샷 텍스트-투-스피치의 음질 상한선을 확장합니다. 임의 길이의 프롬프트 외에도, 우리는 임의 소스 프롬프트를 도입하여 여러 P-LLM 출력에서 도출된 확률을 활용하여 표현력 있고 제어된 프로소디를 생성합니다. 또한, 우리는 음소 수준의 자기회귀 지속 시간 모델을 제안하여 지속 시간 모델링에 문맥 학습 능력을 도입합니다. 실험 결과, 우리의 방법은 보이지 않은 화자의 짧은 프롬프트로 정체성을 보존하는 음성을 합성할 뿐만 아니라 더 긴 음성 프롬프트로 향상된 성능을 달성할 수 있음을 보여줍니다. 오디오 샘플은 https://mega-tts.github.io/mega2_demo/에서 확인할 수 있습니다.

English

Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple P-LLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling. Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in https://mega-tts.github.io/mega2_demo/.

Mega-TTS 2: 임의 길이 음성 프롬프트를 활용한 제로샷 텍스트-투-스피치

Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

초록

Support