MegaTTS 3:稀疏对齐增强的潜在扩散Transformer,用于零样本语音合成
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
February 26, 2025
作者: Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao
cs.AI
摘要
尽管近期的零样本文本转语音(TTS)模型在语音质量和表现力上取得了显著提升,主流系统仍面临语音-文本对齐建模相关的问题:1)缺乏显式语音-文本对齐建模的模型在实用性上表现欠佳,尤其是在处理实际应用中的复杂句子时;2)基于预定义对齐的模型受限于强制对齐的自然性约束。本文介绍MegaTTS 3,一款采用创新稀疏对齐算法引导潜在扩散变换器(DiT)的TTS系统。具体而言,我们为MegaTTS 3提供稀疏对齐边界,以降低对齐难度而不限制搜索空间,从而实现高自然度。此外,我们采用多条件无分类器引导策略进行口音强度调节,并应用分段整流流技术加速生成过程。实验表明,MegaTTS 3在零样本TTS语音质量上达到业界领先水平,并支持高度灵活的口音强度控制。值得注意的是,我们的系统仅需8次采样步骤即可生成高质量的一分钟语音。音频样本可在https://sditdemo.github.io/sditdemo/获取。
English
While recent zero-shot text-to-speech (TTS) models have significantly
improved speech quality and expressiveness, mainstream systems still suffer
from issues related to speech-text alignment modeling: 1) models without
explicit speech-text alignment modeling exhibit less robustness, especially for
hard sentences in practical applications; 2) predefined alignment-based models
suffer from naturalness constraints of forced alignments. This paper introduces
MegaTTS 3, a TTS system featuring an innovative sparse alignment
algorithm that guides the latent diffusion transformer (DiT). Specifically, we
provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of
alignment without limiting the search space, thereby achieving high
naturalness. Moreover, we employ a multi-condition classifier-free guidance
strategy for accent intensity adjustment and adopt the piecewise rectified flow
technique to accelerate the generation process. Experiments demonstrate that
MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports
highly flexible control over accent intensity. Notably, our system can generate
high-quality one-minute speech with only 8 sampling steps. Audio samples are
available at https://sditdemo.github.io/sditdemo/.Summary
AI-Generated Summary