SoulX-Singer: 高品質ゼロショット歌唱音声合成に向けて

要旨

近年、音声合成技術は急速な進歩を遂げているが、オープンソースの歌声合成（SVS）システムは、特にロバスト性とゼロショット汎化性能の点で、産業応用に向けた大きな課題に直面している。本報告では、実用展開を意識して設計された高品質オープンソースSVSシステム「SoulX-Singer」を紹介する。SoulX-Singerは、記号的な楽譜（MIDI）または旋律表現のいずれかを条件とした制御可能な歌声生成をサポートし、実世界の制作ワークフローにおいて柔軟で表現力豊かな制御を可能にする。42,000時間以上の歌声データで学習された本システムは、中国語（普通話）、英語、広東語をサポートし、多様な音楽条件下において言語を問わず常に最先端の合成品質を達成する。さらに、実用的なシナリオにおけるゼロショットSVS性能の信頼性高い評価を可能にするため、厳密な訓練-テスト分離を特徴とする専用ベンチマーク「SoulX-Singer-Eval」を構築し、ゼロショット設定における体系的な評価を容易にした。

English

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.

SoulX-Singer: 高品質ゼロショット歌唱音声合成に向けて

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

要旨

Support