SwanVoice: Expressieve lange-vorm zero-shot spraaksynthese voor zowel monoloog als dialoog

Samenvatting

Zero-shot tekst-naar-spraak (TTS) is aanzienlijk verbeterd voor enkelspreker-synthese, maar expressieve lange meerderstemmige dialoog blijft moeilijk. Een gangbare tijdelijke oplossing is om elke beurt te synthetiseren met een monoloog-TTS-model en de uitvoer aan elkaar te koppelen. Dit verhoogt de inferentiekosten en doorbreekt vaak de akoestische consistentie, conversationele coherentie en affectieve continuïteit tussen beurten. Recente dialoog-TTS-systemen zijn begonnen dit scenario aan te pakken, maar worstelen nog steeds met het gelijktijdig behouden van expressieve coherentie, controleerbare sprekerswisselingen en monoloogkwaliteit. We presenteren SwanData-Speech en SwanVoice. SwanData-Speech bouwt monoloog- en dialoogcorpora uit wild-audio, met behulp van Swan Forced Aligner voor pauze-bewuste woordniveau-uitlijning en RobustMegaTTS3 voor uitspraak-moeilijke gevallen. Gebouwd op deze gegevens is SwanVoice een zero-shot TTS-model voor 1–4 sprekers, dat een 25 Hz VAE, raw-text-conditionering met pauze-bewuste symbolen en pinyin-substitutie, en een stromingskoppeling DiT met spreker-beurt-conditionering combineert. De training begint met monoloogspraak, gaat via gemengde en echte dialoogdata, en gebruikt vervolgens DiffusionNFT post-training met telefoonniveau- en spreker-gelijkheidsbeloningen. Op SwanBench-Speech behaalt SwanVoice hogere rijkheids- en hiërarchiescores dan alle geëvalueerde open-source-baselines in zowel monoloog- als dialoogomgevingen, terwijl inhoudelijke nauwkeurigheid de belangrijkste beperking blijft. Audiodemo's zijn beschikbaar op https://swanaigc.github.io//#swanvoice.

English

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.