Pheme:高效且具对话性的语音生成
Pheme: Efficient and Conversational Speech Generation
January 5, 2024
作者: Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulić
cs.AI
摘要
近年来,语音生成取得了显著进展,如今已经实现了一次性生成能力,往往几乎无法与真实人类声音区分开来。将这些语音生成方面的进步与大型语言模型相结合,可能会彻底改变各种应用。然而,某些应用,如辅助对话系统,需要自然而富有对话性的语音生成工具,同时还能够在实时环境中高效运行。目前的最先进模型,如VALL-E和SoundStorm,由分层神经音频编解码器驱动,需要大型神经组件和大量训练数据才能发挥良好效果。相比之下,MQTTS旨在构建更紧凑的对话式TTS模型,同时利用小规模真实对话语音数据。然而,其自回归性质导致推理延迟高,从而限制了其实时使用。为了缓解目前最先进TTS模型的限制,并充分利用其优势,在本研究中我们介绍了Pheme模型系列,该系列:1)提供紧凑且高性能的模型,2)允许并行生成自然对话语音,3)可以在小规模对话数据上高效训练,将数据需求降低超过10倍,但仍能匹配自回归TTS模型的质量。我们还表明,通过简单的师生蒸馏,我们可以在预训练的Pheme检查点上为单发言者设置的语音质量实现显著改进,仅依赖于更大的师生模型生成的合成语音。音频样本和预训练模型可在线获取。
English
In recent years, speech generation has seen remarkable progress, now
achieving one-shot generation capability that is often virtually
indistinguishable from real human voice. Integrating such advancements in
speech generation with large language models might revolutionize a wide range
of applications. However, certain applications, such as assistive
conversational systems, require natural and conversational speech generation
tools that also operate efficiently in real time. Current state-of-the-art
models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs,
require large neural components and extensive training data to work well. In
contrast, MQTTS aims to build more compact conversational TTS models while
capitalizing on smaller-scale real-life conversational speech data. However,
its autoregressive nature yields high inference latency and thus limits its
real-time usage. In order to mitigate the current limitations of the
state-of-the-art TTS models while capitalizing on their strengths, in this work
we introduce the Pheme model series that 1) offers compact yet high-performing
models, 2) allows for parallel speech generation of 3) natural conversational
speech, and 4) it can be trained efficiently on smaller-scale conversational
data, cutting data demands by more than 10x but still matching the quality of
the autoregressive TTS models. We also show that through simple teacher-student
distillation we can meet significant improvements in voice quality for
single-speaker setups on top of pretrained Pheme checkpoints, relying solely on
synthetic speech generated by much larger teacher models. Audio samples and
pretrained models are available online.