Matcha-TTS: 조건부 플로우 매칭을 활용한 고속 TTS 아키텍처

초록

우리는 빠른 TTS 음향 모델링을 위한 새로운 인코더-디코더 아키텍처인 Matcha-TTS를 소개합니다. 이 모델은 최적 수송 조건부 흐름 매칭(OT-CFM)을 사용하여 학습되었습니다. 이를 통해 스코어 매칭을 사용하여 학습된 모델보다 더 적은 합성 단계로도 높은 출력 품질을 달성할 수 있는 ODE 기반 디코더를 구현했습니다. 또한 신중한 설계 선택을 통해 각 합성 단계가 빠르게 실행되도록 보장했습니다. 이 방법은 확률적이며, 비자기회귀적이고, 외부 정렬 없이 처음부터 말하는 법을 학습합니다. 강력한 사전 학습된 베이스라인 모델과 비교했을 때, Matcha-TTS 시스템은 가장 작은 메모리 사용량을 가지며, 긴 발화에서 가장 빠른 모델의 속도에 필적하고, 청취 테스트에서 가장 높은 평균 의견 점수를 획득했습니다. 오디오 예제, 코드, 사전 학습된 모델은 https://shivammehta25.github.io/Matcha-TTS/에서 확인할 수 있습니다.

English

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models.

Matcha-TTS: 조건부 플로우 매칭을 활용한 고속 TTS 아키텍처

Matcha-TTS: A fast TTS architecture with conditional flow matching

초록

Support