MegaTTS 3：稀疏对齐增强的潜在扩散Transformer，用于零样本语音合成

摘要

尽管近期的零样本文本转语音（TTS）模型在语音质量和表现力上取得了显著提升，主流系统仍面临语音-文本对齐建模相关的问题：1）缺乏显式语音-文本对齐建模的模型在实用性上表现欠佳，尤其是在处理实际应用中的复杂句子时；2）基于预定义对齐的模型受限于强制对齐的自然性约束。本文介绍MegaTTS 3，一款采用创新稀疏对齐算法引导潜在扩散变换器（DiT）的TTS系统。具体而言，我们为MegaTTS 3提供稀疏对齐边界，以降低对齐难度而不限制搜索空间，从而实现高自然度。此外，我们采用多条件无分类器引导策略进行口音强度调节，并应用分段整流流技术加速生成过程。实验表明，MegaTTS 3在零样本TTS语音质量上达到业界领先水平，并支持高度灵活的口音强度控制。值得注意的是，我们的系统仅需8次采样步骤即可生成高质量的一分钟语音。音频样本可在https://sditdemo.github.io/sditdemo/获取。

English

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces MegaTTS 3, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

MegaTTS 3：稀疏对齐增强的潜在扩散Transformer，用于零样本语音合成

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

摘要

Support