SwanVoice: 독백과 대화 모두를 위한 표현적 장문형 제로샷 음성 합성

초록

제로샷 텍스트-음성 합성(Zero-shot TTS)은 단일 화자 합성에서 상당히 개선되었으나, 표현력 있는 장문 다중 화자 대화는 여전히 어려운 과제로 남아 있다. 일반적인 해결 방법은 각 턴을 독백 TTS 모델로 합성한 후 출력을 이어 붙이는 것이다. 이는 추론 비용을 증가시킬 뿐만 아니라, 턴 간 음향 일관성, 대화 일관성, 정서적 연속성을 종종 깨뜨린다. 최근의 대화 TTS 시스템이 이러한 설정을 다루기 시작했지만, 여전히 표현적 일관성, 제어 가능한 화자 전환, 독백 품질을 동시에 유지하는 데 어려움을 겪고 있다. 본 논문에서는 SwanData-Speech와 SwanVoice를 제시한다. SwanData-Speech는 야생 오디오로부터 독백 및 대화 말뭉치를 구축하며, 멈춤 인식 단어 수준 정렬을 위해 Swan 강제 정렬기(Swan Forced Aligner)를 사용하고 발음이 까다로운 경우 RobustMegaTTS3를 활용한다. 이러한 데이터를 기반으로 구축된 SwanVoice는 1~4명의 화자를 위한 제로샷 TTS 모델로, 25Hz VAE, 멈춤 인식 기호 및 병음 대체를 통한 원시 텍스트 조건화, 화자-턴 조건화를 갖춘 흐름 매칭 DiT(flow-matching DiT)를 결합한다. 학습은 독백 음성에서 시작하여 혼합 및 실제 대화 데이터로 진행된 후, 음소 수준 및 화자 유사도 보상을 사용하는 DiffusionNFT 사후 학습(post-training)이 이어진다. SwanBench-Speech에서 SwanVoice는 독백 및 대화 설정 모두에서 평가된 모든 오픈소스 기준 모델보다 높은 풍부도 및 계층 점수를 얻었으며, 내용 정확도가 여전히 주요 한계로 남아 있다. 오디오 데모는 https://swanaigc.github.io//#swanvoice에서 확인할 수 있다.

English

Zero-shot text-to-speech (TTS) has improved substantially for single-speaker synthesis, yet expressive long-form multi-speaker dialogue remains difficult. A common workaround is to synthesize each turn with a monologue TTS model and stitch the outputs together. This adds inference cost and often breaks acoustic consistency, conversational coherence, and affective continuity across turns. Recent dialogue TTS systems have begun to address this setting, but they still struggle to keep expressive coherence, controllable speaker switching, and monologue quality at the same time. We present SwanData-Speech and SwanVoice. SwanData-Speech builds monologue and dialogue corpora from in-the-wild audio, using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for pronunciation-hard cases. Built on these data, SwanVoice is a zero-shot TTS model for 1--4 speakers, combining a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching DiT with speaker-turn conditioning. Training starts from monologue speech, moves through mixed and real dialogue data, and then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards. On SwanBench-Speech, SwanVoice obtains higher richness and hierarchy scores than all evaluated open-source baselines in both monologue and dialogue settings, while content accuracy remains the main limitation. Audio demos are available at https://swanaigc.github.io//#swanvoice.