Pheme:高效且具對話性的語音生成
Pheme: Efficient and Conversational Speech Generation
January 5, 2024
作者: Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulić
cs.AI
摘要
近年來,語音生成取得了顯著進展,現在實現了一次性生成的能力,往往幾乎無法與真實人類聲音區分開來。將這些語音生成方面的進步與大型語言模型相結合,可能會對各種應用產生革命性影響。然而,某些應用,如輔助對話系統,需要自然而具對話性的語音生成工具,同時在實時操作中也能高效運行。目前的頂尖模型,如VALL-E和SoundStorm,由分層神經音頻編解碼器提供動力,需要大型神經組件和豐富的訓練數據才能良好運作。相比之下,MQTTS旨在構建更緊湊的對話式TTS模型,同時利用小規模真實對話語音數據。然而,其自回歸性質導致高推理延遲,因此限制了其實時使用。為了減輕當前頂尖TTS模型的限制,同時利用其優勢,在這項工作中我們介紹了Pheme模型系列,該系列1) 提供緊湊但高性能的模型,2) 允許3) 自然對話式語音的並行生成,並且4) 可以在小規模對話數據上高效訓練,將數據需求降低超過10倍,但仍與自回歸TTS模型的質量相匹配。我們還表明,通過簡單的師生蒸餾,我們可以在預訓練的Pheme檢查點上為單一說話者設置顯著提高語音質量,僅依賴更大的師生模型生成的合成語音。音頻樣本和預訓練模型可在線獲得。
English
In recent years, speech generation has seen remarkable progress, now
achieving one-shot generation capability that is often virtually
indistinguishable from real human voice. Integrating such advancements in
speech generation with large language models might revolutionize a wide range
of applications. However, certain applications, such as assistive
conversational systems, require natural and conversational speech generation
tools that also operate efficiently in real time. Current state-of-the-art
models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs,
require large neural components and extensive training data to work well. In
contrast, MQTTS aims to build more compact conversational TTS models while
capitalizing on smaller-scale real-life conversational speech data. However,
its autoregressive nature yields high inference latency and thus limits its
real-time usage. In order to mitigate the current limitations of the
state-of-the-art TTS models while capitalizing on their strengths, in this work
we introduce the Pheme model series that 1) offers compact yet high-performing
models, 2) allows for parallel speech generation of 3) natural conversational
speech, and 4) it can be trained efficiently on smaller-scale conversational
data, cutting data demands by more than 10x but still matching the quality of
the autoregressive TTS models. We also show that through simple teacher-student
distillation we can meet significant improvements in voice quality for
single-speaker setups on top of pretrained Pheme checkpoints, relying solely on
synthetic speech generated by much larger teacher models. Audio samples and
pretrained models are available online.