Matcha-TTS：具有條件流匹配的快速TTS架構

摘要

我們介紹了 Matcha-TTS，一種新的編碼器-解碼器架構，用於快速 TTS 聲學建模，使用最佳傳輸條件流匹配（OT-CFM）進行訓練。這產生了一個基於 ODE 的解碼器，能夠在比使用分數匹配訓練的模型更少的合成步驟中產生高質量的輸出。精心設計的選擇確保每個合成步驟運行速度快。該方法是概率的、非自回歸的，並且能夠從頭開始學會說話而無需外部對齊。與強大的預訓練基線模型相比，Matcha-TTS 系統具有最小的記憶體佔用量，在長句子上與最快模型的速度相媲美，並在聽測試中獲得最高的平均意見分數。請參閱 https://shivammehta25.github.io/Matcha-TTS/ 以獲取音頻示例、代碼和預訓練模型。

English

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models.

Matcha-TTS：具有條件流匹配的快速TTS架構

Matcha-TTS: A fast TTS architecture with conditional flow matching

摘要

Support