메가-TTS: 내재적 귀납 편향을 활용한 대규모 제로샷 텍스트-음성 변환

초록

대규모 및 다양한 데이터셋에 대한 텍스트-음성 변환(Text-to-Speech, TTS)의 확장은, 특히 제로샷 TTS에서 음색 및 발화 스타일 일반화를 달성하는 데 매우 효과적인 것으로 입증되었습니다. 그러나 기존 연구들은 일반적으로 오디오 코덱을 사용하여 음성을 잠재 공간으로 인코딩하고, 이를 생성하기 위해 자기회귀 언어 모델이나 확산 모델을 사용하는데, 이는 음성의 본질적인 특성을 무시하고 열등하거나 통제 불가능한 결과를 초래할 수 있습니다. 우리는 음성이 여러 속성(예: 내용, 음색, 운율, 위상)으로 분해될 수 있으며, 각 속성은 적절한 귀납적 편향을 가진 모듈로 모델링되어야 한다고 주장합니다. 이러한 관점에서, 우리는 대규모의 다양한 데이터로 학습되고 각 속성을 다른 방식으로 모델링하는 Mega-TTS라는 새로운 대형 제로샷 TTS 시스템을 신중하게 설계했습니다: 1) 오디오 코덱에 의해 인코딩된 잠재 공간을 중간 특징으로 사용하는 대신, 위상과 다른 속성을 잘 분리하는 스펙트로그램을 선택했습니다. 위상은 GAN 기반 보코더에 의해 적절히 구성될 수 있으며 언어 모델에 의해 모델링될 필요가 없습니다. 2) 음색은 시간에 따라 느리게 변화하는 전역 속성이므로 전역 벡터를 사용하여 모델링합니다. 3) 운율은 문장 내에서 빠르게 변화하며 언어 모델이 지역적 및 장거리 의존성을 모두 포착할 수 있으므로, VQGAN 기반 음향 모델을 사용하여 스펙트로그램을 생성하고 잠재 코드 언어 모델을 사용하여 운율의 분포를 맞춥니다. 우리는 Mega-TTS를 20,000시간의 음성 데이터를 포함한 다중 도메인 데이터셋으로 확장하고, 보이지 않는 화자에 대한 성능을 평가했습니다. 실험 결과는 Mega-TTS가 각 모듈의 적절한 귀납적 편향 덕분에 제로샷 TTS, 음성 편집, 그리고 교차 언어 TTS 작업에서 최첨단 TTS 시스템을 능가하며, 우수한 자연스러움, 견고성, 그리고 화자 유사성을 보여줍니다. 오디오 샘플은 https://mega-tts.github.io/demo-page에서 확인할 수 있습니다.

English

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.

메가-TTS: 내재적 귀납 편향을 활용한 대규모 제로샷 텍스트-음성 변환

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

초록

Support