Seed-TTS: 고품질 다목적 음성 생성 모델 패밀리

초록

우리는 인간의 음성과 거의 구분할 수 없는 음성을 생성할 수 있는 대규모 자기회귀적 텍스트-투-스피치(TTS) 모델군인 Seed-TTS를 소개합니다. Seed-TTS는 음성 생성을 위한 기반 모델로 작동하며, 문맥 내 음성 학습에서 탁월한 성능을 발휘합니다. 이 모델은 객관적 및 주관적 평가 모두에서 화자 유사성과 자연스러움 측면에서 실제 인간 음성과 맞먹는 성능을 달성합니다. 미세 조정을 통해 이러한 지표들에서 더 높은 주관적 점수를 얻을 수 있습니다. Seed-TTS는 감정과 같은 다양한 음성 속성에 대한 우수한 제어력을 제공하며, 실제 환경의 화자들을 위해 매우 표현력 있고 다양한 음성을 생성할 수 있습니다. 또한, 음성 분해를 위한 자기 증류 방법과 모델의 견고성, 화자 유사성, 제어력을 향상시키기 위한 강화 학습 접근법을 제안합니다. 더불어, 완전히 확산 기반 아키텍처를 활용하는 Seed-TTS의 비자기회귀적(NAR) 변형인 Seed-TTS_DiT를 소개합니다. 이전의 NAR 기반 TTS 시스템들과 달리, Seed-TTS_DiT는 사전 추정된 음소 지속 시간에 의존하지 않고 엔드-투-엔드 처리로 음성 생성을 수행합니다. 이 변형이 언어 모델 기반 변형과 비슷한 성능을 달성하며, 음성 편집에서의 효과성을 입증합니다. 독자들에게 데모를 들어보기를 권장하며, 데모는 https://bytedancespeech.github.io/seedtts_tech_report에서 확인할 수 있습니다.

English

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named Seed-TTS_DiT, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, Seed-TTS_DiT does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at https://bytedancespeech.github.io/seedtts_tech_report.

Seed-TTS: 고품질 다목적 음성 생성 모델 패밀리

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

초록

Support