ChatPaper.aiChatPaper

自然语音3:使用分解编解码器和扩散模型进行零-shot语音合成

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

March 5, 2024
作者: Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao
cs.AI

摘要

尽管最近的大规模文本转语音(TTS)模型取得了显著进展,但在语音质量、相似度和韵律方面仍存在不足。考虑到语音复杂地涵盖了各种属性(例如内容、韵律、音色和声学细节),这给生成带来了重大挑战,一个自然的想法是将语音因子分解为代表不同属性的个体子空间,并分别生成它们。受此启发,我们提出了NaturalSpeech 3,这是一个具有新颖的因子扩散模型的TTS系统,可以以零-shot方式生成自然语音。具体而言,1)我们设计了一个具有因子化向量量化(FVQ)的神经编解码器,将语音波形分解为内容、韵律、音色和声学细节的子空间;2)我们提出了一个因子化扩散模型,根据其对应的提示生成每个子空间中的属性。通过这种因子化设计,NaturalSpeech 3可以以一种分而治之的方式有效且高效地对复杂的语音进行建模。实验证明,NaturalSpeech 3在质量、相似度、韵律和可懂性方面优于最先进的TTS系统。此外,通过扩展至10亿参数和20万小时的训练数据,我们实现了更好的性能。
English
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model the intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Summary

AI-Generated Summary

PDF383December 15, 2024