MegaTTS 3:基於稀疏對齊增強潛在擴散變換器的零樣本語音合成
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
February 26, 2025
作者: Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao
cs.AI
摘要
儘管近期的零樣本文字轉語音(TTS)模型在語音品質和表現力上取得了顯著進步,主流系統仍面臨語音-文字對齊建模的相關問題:1)未採用顯式語音-文字對齊建模的模型在實際應用中對複雜句子的處理上表現出較低的魯棒性;2)基於預定義對齊的模型則受限於強制對齊的自然性約束。本文介紹了MegaTTS 3,這是一款TTS系統,其特色在於引入了一種創新性的稀疏對齊算法,該算法指導著潛在擴散變換器(DiT)。具體而言,我們為MegaTTS 3提供了稀疏對齊邊界,以在不限制搜索空間的前提下降低對齊難度,從而實現高自然度。此外,我們採用了一種多條件無分類器指導策略來調整口音強度,並採用了分段整流流技術以加速生成過程。實驗證明,MegaTTS 3在零樣本TTS語音品質上達到了業界領先水平,並支持對口音強度的高度靈活控制。值得注意的是,我們的系統僅需8次採樣步驟即可生成高品質的一分鐘語音。音頻樣本可在https://sditdemo.github.io/sditdemo/ 獲取。
English
While recent zero-shot text-to-speech (TTS) models have significantly
improved speech quality and expressiveness, mainstream systems still suffer
from issues related to speech-text alignment modeling: 1) models without
explicit speech-text alignment modeling exhibit less robustness, especially for
hard sentences in practical applications; 2) predefined alignment-based models
suffer from naturalness constraints of forced alignments. This paper introduces
MegaTTS 3, a TTS system featuring an innovative sparse alignment
algorithm that guides the latent diffusion transformer (DiT). Specifically, we
provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of
alignment without limiting the search space, thereby achieving high
naturalness. Moreover, we employ a multi-condition classifier-free guidance
strategy for accent intensity adjustment and adopt the piecewise rectified flow
technique to accelerate the generation process. Experiments demonstrate that
MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports
highly flexible control over accent intensity. Notably, our system can generate
high-quality one-minute speech with only 8 sampling steps. Audio samples are
available at https://sditdemo.github.io/sditdemo/.Summary
AI-Generated Summary