F5-TTS：一個製造流暢且忠實語音的童話故事生成器

摘要

本文介紹了F5-TTS，一種基於流匹配與擴散Transformer（DiT）的完全非自回歸文本轉語音系統。該系統無需複雜的設計，如持續時間模型、文本編碼器和音素對齊，只需將文本輸入簡單地填充為與語音輸入相同的長度，然後對語音進行去噪以進行語音生成，這最初是由E2 TTS證明可行的。然而，E2 TTS的原始設計由於收斂速度慢且魯棒性低而難以跟隨。為解決這些問題，我們首先使用ConvNeXt對輸入進行建模以優化文本表示，使其易於與語音對齊。我們進一步提出了一種推斷時的擺動取樣策略，顯著提高了我們模型的性能和效率。這種流程步驟的取樣策略可以輕鬆應用於現有基於流匹配的模型而無需重新訓練。我們的設計實現了更快的訓練，並實現了0.15的推斷實時因子（RTF），與最先進的基於擴散的TTS模型相比有了很大的改進。在公開的100K小時多語種數據集上訓練，我們的Fairytaler Fakes Fluent and Faithful speech with Flow matching（F5-TTS）展現了高度自然和富有表現力的零樣本能力、無縫的代碼切換能力和速度控制效率。演示樣本可在https://SWivid.github.io/F5-TTS找到。我們釋放所有代碼和檢查點以促進社區發展。

English

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We release all code and checkpoints to promote community development.

F5-TTS：一個製造流暢且忠實語音的童話故事生成器

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

摘要

Support