Pheme：効率的で会話的な音声生成

要旨

近年、音声生成技術は目覚ましい進歩を遂げ、今や一発生成能力において本物の人間の声とほとんど見分けがつかないレベルに達しています。このような音声生成の進歩を大規模言語モデルと統合することで、幅広いアプリケーションに革命をもたらす可能性があります。しかし、アシスト会話システムなどの特定のアプリケーションでは、自然で会話的な音声生成ツールがリアルタイムで効率的に動作することが求められます。現在の最先端モデルであるVALL-EやSoundStormは、階層型ニューラルオーディオコーデックを活用していますが、良好な性能を発揮するためには大規模なニューラルコンポーネントと膨大なトレーニングデータを必要とします。一方、MQTTSは、よりコンパクトな会話型TTSモデルを構築しつつ、小規模な実生活会話音声データを活用することを目指しています。しかし、その自己回帰的な性質により高い推論遅延が生じ、リアルタイム使用が制限されています。本論文では、最先端TTSモデルの現状の制約を緩和しつつその強みを活かすため、Phemeモデルシリーズを紹介します。このシリーズは、1) コンパクトでありながら高性能なモデルを提供し、2) 並列音声生成を可能にし、3) 自然な会話音声を生成し、4) 小規模な会話データで効率的にトレーニングが可能で、データ要求を10分の1以上削減しながらも自己回帰型TTSモデルと同等の品質を維持します。また、単一話者設定において、事前学習済みPhemeチェックポイントに基づき、より大規模な教師モデルによって生成された合成音声のみを利用して、シンプルな教師-生徒蒸留により音声品質の大幅な改善が可能であることも示します。音声サンプルと事前学習済みモデルはオンラインで公開されています。

English

In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

Pheme：効率的で会話的な音声生成

Pheme: Efficient and Conversational Speech Generation

要旨

Support