SwanVoice：面向獨白與對話的表現力豐富長篇零樣本語音合成

摘要

零樣本文字轉語音（TTS）在單一說話者合成方面已有顯著進步，但具表現力的長篇多人對話仍具挑戰性。常見的應對方式是使用獨白TTS模型合成每個說話輪次，再將輸出拼接起來。這種方法會增加推論成本，且常破壞跨輪次的聲學一致性、對話連貫性與情感連續性。近期的對話TTS系統已開始針對此情境進行改良，但仍難以同時維持表現力的連貫性、可控的說話者切換以及獨白品質。我們提出SwanData-Speech與SwanVoice。SwanData-Speech從現實音訊中建構獨白與對話語料庫，利用Swan強制對齊器（Swan Forced Aligner）進行停頓感知的字級對齊，並以RobustMegaTTS3處理發音困難的案例。基於這些資料，SwanVoice是一個支援1至4位說話者的零樣本TTS模型，結合25 Hz VAE、搭配停頓感知符號與拼音替代的純文字條件輸入，以及具說話者輪次條件的流匹配DiT。訓練從獨白語音開始，經混合資料與真實對話資料，再使用擴散NFT後訓練（DiffusionNFT post-training），搭配音節級與說話者相似度獎勵。在SwanBench-Speech上，SwanVoice在獨白與對話設定中，於豐富性與層次評分上均優於所有評估的開源基準模型，但內容準確度仍是主要限制。語音示範可於 https://swanaigc.github.io//#swanvoice 取得。

English

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.