Seed-TTS:一系列高品質多功能語音生成模型
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
June 4, 2024
作者: Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, Junjie Pan, Xin Wang, Yuping Wang, Yuxuan Wang, Zhen Wei, Jian Wu, Chao Yao, Yifeng Yang, Yuanhao Yi, Junteng Zhang, Qidi Zhang, Shuo Zhang, Wenjie Zhang, Yang Zhang, Zilin Zhao, Dejian Zhong, Xiaobin Zhuang
cs.AI
摘要
我們介紹了Seed-TTS,這是一系列大規模自回歸文本轉語音(TTS)模型,能夠生成幾乎無法區分的人類語音。Seed-TTS作為語音生成的基礎模型,在語境學習中表現出色,實現了在語者相似度和自然度方面與真實人類語音相匹配的性能,這在客觀和主觀評估中均得到證實。通過微調,我們在這些指標上實現了更高的主觀得分。Seed-TTS在各種語音屬性(如情感)的控制能力方面優越,能夠為野外說話者生成高度表達豐富且多樣化的語音。此外,我們提出了一種用於語音因子分解的自蒸餾方法,以及一種增強模型韌性、語者相似度和可控性的強化學習方法。我們還提出了Seed-TTS模型的非自回歸(NAR)變體,名為Seed-TTS_DiT,採用完全基於擴散的架構。與先前基於NAR的TTS系統不同,Seed-TTS_DiT不依賴於預估的音素持續時間,並通過端到端處理進行語音生成。我們展示了這個變體實現了與基於語言模型的變體相當的性能,並展示了其在語音編輯中的有效性。我們鼓勵讀者在https://bytedancespeech.github.io/seedtts_tech_report上聆聽演示。
English
We introduce Seed-TTS, a family of large-scale autoregressive text-to-speech
(TTS) models capable of generating speech that is virtually indistinguishable
from human speech. Seed-TTS serves as a foundation model for speech generation
and excels in speech in-context learning, achieving performance in speaker
similarity and naturalness that matches ground truth human speech in both
objective and subjective evaluations. With fine-tuning, we achieve even higher
subjective scores across these metrics. Seed-TTS offers superior
controllability over various speech attributes such as emotion and is capable
of generating highly expressive and diverse speech for speakers in the wild.
Furthermore, we propose a self-distillation method for speech factorization, as
well as a reinforcement learning approach to enhance model robustness, speaker
similarity, and controllability. We additionally present a non-autoregressive
(NAR) variant of the Seed-TTS model, named Seed-TTS_DiT, which
utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS
systems, Seed-TTS_DiT does not depend on pre-estimated phoneme
durations and performs speech generation through end-to-end processing. We
demonstrate that this variant achieves comparable performance to the language
model-based variant and showcase its effectiveness in speech editing. We
encourage readers to listen to demos at
https://bytedancespeech.github.io/seedtts_tech_report.Summary
AI-Generated Summary