Seed-TTS：一系列高质量多功能语音生成模型

摘要

我们介绍了Seed-TTS，这是一系列大规模自回归文本到语音（TTS）模型，能够生成几乎无法区分的人类语音。Seed-TTS作为语音生成的基础模型，在语境学习中表现出色，实现了在说话者相似度和自然度方面与地面真实人类语音相匹配的性能，无论是客观评估还是主观评估。通过微调，我们在这些指标上实现了更高的主观评分。Seed-TTS在各种语音属性（如情感）上提供了优越的可控性，并能够为野外说话者生成高度表现力丰富和多样化的语音。此外，我们提出了一种用于语音因子分解的自蒸馏方法，以及一种增强模型鲁棒性、说话者相似度和可控性的强化学习方法。我们还提出了Seed-TTS模型的非自回归（NAR）变体，命名为Seed-TTS_DiT，它采用完全基于扩散的架构。与先前基于NAR的TTS系统不同，Seed-TTS_DiT不依赖于预估的音素持续时间，并通过端到端处理进行语音生成。我们展示了这个变体实现了与基于语言模型的变体相当的性能，并展示了它在语音编辑中的有效性。我们鼓励读者在https://bytedancespeech.github.io/seedtts_tech_report上听取演示。

English

We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. Seed-TTS serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, we achieve even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Furthermore, we propose a self-distillation method for speech factorization, as well as a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. We additionally present a non-autoregressive (NAR) variant of the Seed-TTS model, named Seed-TTS_DiT, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, Seed-TTS_DiT does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. We demonstrate that this variant achieves comparable performance to the language model-based variant and showcase its effectiveness in speech editing. We encourage readers to listen to demos at https://bytedancespeech.github.io/seedtts_tech_report.

Seed-TTS：一系列高质量多功能语音生成模型

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

摘要

Support