增量式FastPitch：基于块的高质量文本转语音

摘要

并行文本到语音模型已被广泛应用于实时语音合成，与传统的自回归模型相比，它们提供了更多的可控性和更快的合成过程。尽管并行模型在许多方面都有优势，但由于其完全并行的架构（如Transformer），它们自然而然地不适用于增量合成。在这项工作中，我们提出了增量FastPitch，这是一种新颖的FastPitch变体，通过改进基于块的FFT块的架构、使用受限制的接受域块注意力蒙版进行训练，以及使用固定大小的过去模型状态进行推断，能够增量地生成高质量的Mel块。实验结果表明，我们的提议可以产生与并行FastPitch相媲美的语音质量，同时具有显著更低的延迟，从而为实时语音应用提供了更低的响应时间。

English

Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.

增量式FastPitch：基于块的高质量文本转语音

Incremental FastPitch: Chunk-based High Quality Text to Speech

摘要

Support