Emilia: 音声生成のための大規模で包括的な多言語かつ多様なデータセット

要旨

最近の音声生成の進歩は、大規模なトレーニングデータセットによって推進されてきました。しかし、現在のモデルは、正式な読み上げスタイルに限定されたオーディオブックデータセットに依存しているため、実世界の人間の音声に固有の自発性と変動性を捉えることには至っていません。このギャップを埋めるために、私たちはEmilia-Pipeを導入しました。これは、実世界のコンテキストでの自発的な人間の音声を捉えた貴重だが未開拓のデータから高品質のトレーニングデータを抽出するオープンソースの前処理パイプラインです。Emilia-Pipeを活用することで、私たちは、実世界の音声データから派生した最初の多言語音声生成データセットであるEmiliaを構築しました。このデータセットには、英語、中国語、ドイツ語、フランス語、日本語、韓国語の6言語で101,000時間以上の音声が含まれています。さらに、EmiliaをEmilia-Largeに拡張し、216,000時間を超えるデータセットとしました。これにより、Emiliaは、最大のオープンソース音声生成データセットとなりました。幅広い実験により、Emiliaが伝統的なオーディオブックデータセットよりも自発的で人間らしい音声を生成することで著しく優れており、実世界の人間の音声の多様な話者の音色や話し方を捉える性能に優れていることが示されています。さらに、この研究は、音声生成研究を推進するためにデータセットのサイズを拡大する重要性を強調し、Emiliaが多言語およびクロスリンガル音声生成において効果的であることを検証しています。

English

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

Emilia: 音声生成のための大規模で包括的な多言語かつ多様なデータセット

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

要旨

Support