SwanVoice:适用于独白与对话的富有表现力的长文本零样本语音合成
SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue
May 29, 2026
作者: Ruiqi Li, Yu Zhang, Changhao Pan, Ke Lei, Xiang Yin, Cheng Yang
cs.AI
摘要
零样本语音合成(TTS)在单说话人场景中已取得显著进步,但兼具表现力的长篇多说话人对话合成仍面临挑战。通常的解决方案是使用单对话TTS模型逐段合成并拼接输出,但这既增加了推理成本,也常导致跨话轮间的声学一致性、对话连贯性和情感连续性受损。近期对话TTS系统开始应对这一场景,但仍在保持表现力连贯性、可控说话人切换及单对话质量方面存在困难。我们提出了SwanData-Speech和SwanVoice。SwanData-Speech从野外音频构建单对话和对话语料库,采用Swan强制对齐器实现基于停顿感知的词级对齐,并利用RobustMegaTTS3处理发音困难案例。基于这些数据,SwanVoice是为1-4位说话人设计的零样本TTS模型,融合了25Hz变分自编码器、原始文本条件(含停顿感知符号及拼音替换)以及结合说话人-话轮条件的流匹配DiT模型。训练过程从单对话语音起步,逐步过渡到混合及真实对话数据,随后采用基于音素级和说话人相似度奖励的DiffusionNFT后训练。在SwanBench-Speech基准测试中,SwanVoice在单对话和对话场景下的丰富度和层次感得分均高于所有评估的开源基线模型,但内容准确性仍为主要局限。音频演示见https://swanaigc.github.io//#swanvoice。
English
Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.