SoulX-Singer: Op weg naar hoogwaardige zangstem-synthese zonder voorbeelden

Samenvatting

Hoewel de afgelopen jaren een snelle vooruitgang in spraaksynthese te zien was, kampen open-source zangstemsynthesesystemen (SVS) nog steeds met aanzienlijke belemmeringen voor industriële implementatie, met name op het gebied van robuustheid en zero-shot generalisatie. In dit rapport introduceren we SoulX-Singer, een hoogwaardig open-source SVS-systeem dat is ontworpen met praktische implementatieoverwegingen in het achterhoofd. SoulX-Singer ondersteunt controleerbare zanggeneratie op basis van symbolische partituren (MIDI) of melodische representaties, wat flexibele en expressieve controle in real-world productieworkflows mogelijk maakt. Getraind op meer dan 42.000 uur aan vocale data, ondersteunt het systeem Mandarijn Chinees, Engels en Kantonees en behaalt het consistent state-of-the-art synthesekwaliteit over verschillende talen heen onder uiteenlopende muzikale omstandigheden. Verder construeren we, om een betrouwbare evaluatie van zero-shot SVS-prestaties in praktijkscenario's mogelijk te maken, SoulX-Singer-Eval: een toegewijd benchmark met strikte scheiding tussen trainings- en testdata, wat systematische beoordeling in zero-shot settings vergemakkelijkt.

English

While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.

SoulX-Singer: Op weg naar hoogwaardige zangstem-synthese zonder voorbeelden

SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis

Samenvatting

Support