F5-TTS：一种利用流匹配伪造流畅且忠实语音的童话生成器

摘要

本文介绍了F5-TTS，这是一种基于扩散变压器（DiT）的完全非自回归文本转语音系统。无需复杂的设计，如持续时间模型、文本编码器和音素对齐，文本输入只需用填充标记填充到与输入语音相同的长度，然后进行去噪以进行语音生成，这一方法最初由E2 TTS证明是可行的。然而，E2 TTS的原始设计由于收敛速度慢和鲁棒性低而难以跟进。为解决这些问题，我们首先使用ConvNeXt对输入进行建模以优化文本表示，使其易于与语音对齐。我们进一步提出了一种推断时的Sway Sampling策略，显著提高了我们模型的性能和效率。这种用于流步骤的采样策略可以轻松应用于现有基于流匹配的模型而无需重新训练。我们的设计实现了更快的训练，并实现了0.15的推断实时因子（RTF），与最先进的基于扩散的TTS模型相比有了很大的改进。在公共的100K小时多语种数据集上训练，我们的Fairytaler Fakes Fluent and Faithful speech with Flow matching（F5-TTS）展现出高度自然和富有表现力的零翻译能力、无缝的代码切换能力和速度控制效率。演示样本可在https://SWivid.github.io/F5-TTS找到。我们发布所有代码和检查点以促进社区发展。

English

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We release all code and checkpoints to promote community development.

F5-TTS：一种利用流匹配伪造流畅且忠实语音的童话生成器

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

摘要

Support