Emilia：一個大規模、廣泛、多語言和多樣化的語音生成數據集

摘要

最近在語音生成方面的進展主要是由大規模訓練數據集推動的。然而，目前的模型在捕捉真實世界人類語音中固有的 spontaneity 和 variability 方面仍有不足，這是因為它們依賴於僅限於正式朗讀語音風格的有聲書數據集。為了彌補這一差距，我們引入了 Emilia-Pipe，這是一個開源的預處理流程，從珍貴但尚未被充分探索的野外數據中提取高質量的訓練數據，這些數據捕捉了真實世界情境中的 spontaneity 人類語音。通過利用 Emilia-Pipe，我們構建了 Emilia，這是第一個從野外語音數據中衍生的多語種語音生成數據集。該數據集包含六種語言的超過 101,000 小時的語音：英語、中文、德語、法語、日語和韓語。此外，我們將 Emilia 擴展為 Emilia-Large，這是一個超過 216,000 小時的數據集，使其成為目前最大的開源語音生成數據集。大量實驗表明，Emilia 在生成 spontaneity 和人類般語音方面顯著優於傳統的有聲書數據集，展示了在捕捉真實世界人類語音的多樣說話者音色和風格方面的卓越性能。此外，這項工作強調了擴大數據集大小對於推進語音生成研究的重要性，並驗證了 Emilia 在多語種和跨語種語音生成方面的有效性。

English

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.