SoulX-Singer: 고품질 제로샷 노래 음성 합성 기술

초록

최근 몇 년간 음성 합성 기술이 급속도로 발전했지만, 오픈소스 노래 음성 합성(SVS) 시스템은 특히 견고성과 제로샷 일반화 측면에서 산업적 배포에 상당한 장벽에 직면해 있습니다. 본 보고서에서는 실질적인 배포를 고려하여 설계된 고품질 오픈소스 SVS 시스템인 SoulX-Singer를 소개합니다. SoulX-Singer는 기호 악보(MIDI) 또는 멜로디 표현을 조건으로 하는 제어 가능한 노래 생성을 지원하여 실제 프로덕션 워크플로우에서 유연하고 표현력丰富的한 제어를 가능하게 합니다. 42,000시간 이상의 보컬 데이터로 학습된 이 시스템은 중국어(만다린), 영어, 광둥어를 지원하며 다양한 음악적 조건에서 언어에 관계없이 일관되게 최첨단 합성 품질을 달성합니다. 나아가 실제 시나리오에서 제로샷 SVS 성능을 신뢰성 있게 평가할 수 있도록, 엄격한 훈련-테스트 분리를 갖춘 전용 벤치마크인 SoulX-Singer-Eval을 구축하여 제로샷 환경에서의 체계적인 평가를 용이하게 합니다.

English

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.

SoulX-Singer: 고품질 제로샷 노래 음성 합성 기술

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

초록

Support