增量式 FastPitch：基於塊的高品質文本轉語音

摘要

並行文本轉語音模型已被廣泛應用於實時語音合成，與傳統的自回歸模型相比，它們提供更多可控性和更快的合成過程。儘管並行模型在許多方面都有優勢，但由於其完全並行的架構（如變壓器），它們自然地不適用於增量式合成。在這項工作中，我們提出了增量式 FastPitch，這是一種新型的 FastPitch 變體，通過改進基於塊的 FFT 塊架構、使用受限接受域的塊注意力遮罩進行訓練，以及使用固定大小的過去模型狀態進行推斷，能夠增量地生成高質量的 Mel 塊。實驗結果表明，我們的提案能夠產生與並行 FastPitch 相當的語音質量，同時具有顯著較低的延遲，這使得實時語音應用的響應時間甚至更短。

English

Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.

增量式 FastPitch：基於塊的高品質文本轉語音

Incremental FastPitch: Chunk-based High Quality Text to Speech

摘要

Support