Seed-TTS:一系列高质量多功能语音生成模型
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
June 4, 2024
作者: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, Xiaobin Zhuang
cs.AI
摘要
我们介绍了Seed-TTS,这是一系列大规模自回归文本到语音(TTS)模型,能够生成几乎无法区分的人类语音。Seed-TTS作为语音生成的基础模型,在语境学习中表现出色,实现了在说话者相似度和自然度方面与地面真实人类语音相匹配的性能,无论是客观评估还是主观评估。通过微调,我们在这些指标上实现了更高的主观评分。Seed-TTS在各种语音属性(如情感)上提供了优越的可控性,并能够为野外说话者生成高度表现力丰富和多样化的语音。此外,我们提出了一种用于语音因子分解的自蒸馏方法,以及一种增强模型鲁棒性、说话者相似度和可控性的强化学习方法。我们还提出了Seed-TTS模型的非自回归(NAR)变体,命名为Seed-TTS_DiT,它采用完全基于扩散的架构。与先前基于NAR的TTS系统不同,Seed-TTS_DiT不依赖于预估的音素持续时间,并通过端到端处理进行语音生成。我们展示了这个变体实现了与基于语言模型的变体相当的性能,并展示了它在语音编辑中的有效性。我们鼓励读者在https://bytedancespeech.github.io/seedtts_tech_report上听取演示。
English
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech
(TTS) models capable of generating speech that is virtually indistinguishable
from human speech. Seed-TTS serves as a foundation model for speech generation
and excels in speech in-context learning, achieving performance in speaker
similarity and naturalness that matches ground truth human speech in both
objective and subjective evaluations. With fine-tuning, we achieve even higher
subjective scores across these metrics. Seed-TTS offers superior
controllability over various speech attributes such as emotion and is capable
of generating highly expressive and diverse speech for speakers in the wild.
Furthermore, we propose a self-distillation method for speech factorization, as
well as a reinforcement learning approach to enhance model robustness, speaker
similarity, and controllability. We additionally present a non-autoregressive
(NAR) variant of the Seed-TTS model, named Seed-TTS_DiT, which
utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS
systems, Seed-TTS_DiT does not depend on pre-estimated phoneme
durations and performs speech generation through end-to-end processing. We
demonstrate that this variant achieves comparable performance to the language
model-based variant and showcase its effectiveness in speech editing. We
encourage readers to listen to demos at
https://bytedancespeech.github.io/seedtts_tech_report.Summary
AI-Generated Summary