ChatPaper.aiChatPaper

SoulX-Singer:实现高质量零样本歌声合成的探索之路

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

February 8, 2026
作者: Jiale Qian, Hao Meng, Tian Zheng, Pengcheng Zhu, Haopeng Lin, Yuhang Dai, Hanke Xie, Wenxiao Cao, Ruixuan Shang, Jun Wu, Hongmei Liu, Hanlin Wen, Jian Zhao, Zhonglin Jiang, Yong Chen, Shunshun Yin, Ming Tao, Jianguo Wei, Lei Xie, Xinsheng Wang
cs.AI

摘要

近年来,虽然语音合成技术发展迅猛,但开源歌唱合成系统在工业级部署中仍面临重大挑战,尤其在鲁棒性与零样本泛化能力方面。本报告推出SoulX-Singer——一款基于实际部署考量设计的高质量开源歌唱合成系统。该系统支持基于符号乐谱(MIDI)或旋律表征的可控歌声生成,可在真实生产流程中实现灵活且富有表现力的控制。经过超过4.2万小时人声数据训练,该系统支持中文普通话、英语及粤语,并在多样化音乐场景下持续实现跨语言的顶尖合成质量。此外,为在实际场景中实现零样本歌唱合成性能的可靠评估,我们构建了具有严格训练-测试数据分离特性的专用基准数据集SoulX-Singer-Eval,为零样本场景下的系统化评估提供支持。
English
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
PDF32February 11, 2026