SoulX-Singer:迈向高质量零样本歌声合成新纪元
SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis
February 8, 2026
作者: Jiale Qian, Hao Meng, Tian Zheng, Pengcheng Zhu, Haopeng Lin, Yuhang Dai, Hanke Xie, Wenxiao Cao, Ruixuan Shang, Jun Wu, Hongmei Liu, Hanlin Wen, Jian Zhao, Zhonglin Jiang, Yong Chen, Shunshun Yin, Ming Tao, Jianguo Wei, Lei Xie, Xinsheng Wang
cs.AI
摘要
近年来,虽然语音合成技术取得了飞速进展,但开源歌声合成系统在工业部署方面仍面临重大挑战,尤其在鲁棒性和零样本泛化能力方面。本报告推出SoulX-Singer——一款兼顾高质量与实用性的开源歌声合成系统。该系统支持基于符号乐谱(MIDI)或旋律表征的可控歌声生成,能够在实际制作流程中实现灵活且富有表现力的控制。基于超过42,000小时人声数据训练,该系统支持普通话、英语和粤语,在不同音乐语境下均能持续实现跨语言的最优合成质量。此外,为可靠评估实际场景中的零样本歌声合成性能,我们构建了SoulX-Singer-Eval专用基准数据集,该数据集严格遵循训练-测试分离原则,为零样本场景下的系统化评估提供支持。
English
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.