ChatPaper.aiChatPaper

MegaTTS 3:稀疏对齐增强的潜在扩散Transformer,用于零样本语音合成

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

February 26, 2025
作者: Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao
cs.AI

摘要

尽管近期的零样本文本转语音(TTS)模型在语音质量和表现力上取得了显著提升,主流系统仍面临语音-文本对齐建模相关的问题:1)缺乏显式语音-文本对齐建模的模型在实用性上表现欠佳,尤其是在处理实际应用中的复杂句子时;2)基于预定义对齐的模型受限于强制对齐的自然性约束。本文介绍MegaTTS 3,一款采用创新稀疏对齐算法引导潜在扩散变换器(DiT)的TTS系统。具体而言,我们为MegaTTS 3提供稀疏对齐边界,以降低对齐难度而不限制搜索空间,从而实现高自然度。此外,我们采用多条件无分类器引导策略进行口音强度调节,并应用分段整流流技术加速生成过程。实验表明,MegaTTS 3在零样本TTS语音质量上达到业界领先水平,并支持高度灵活的口音强度控制。值得注意的是,我们的系统仅需8次采样步骤即可生成高质量的一分钟语音。音频样本可在https://sditdemo.github.io/sditdemo/获取。
English
While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces MegaTTS 3, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

Summary

AI-Generated Summary

PDF122April 3, 2025