점진적 FastPitch: 청크 기반 고품질 텍스트 음성 변환

초록

병렬 텍스트-음성 변환 모델은 실시간 음성 합성에 널리 적용되어 왔으며, 기존의 자기회귀 모델에 비해 더 높은 제어 가능성과 훨씬 빠른 합성 프로세스를 제공합니다. 병렬 모델은 여러 측면에서 장점이 있지만, 트랜스포머와 같은 완전 병렬 아키텍처로 인해 점진적 합성에는 자연스럽게 적합하지 않습니다. 본 연구에서는 청크 기반 FFT 블록을 통해 아키텍처를 개선하고, 수용 영역이 제한된 청크 주의 마스크로 학습하며, 고정 크기의 과거 모델 상태를 사용하여 추론하는 새로운 FastPitch 변형인 Incremental FastPitch를 제안합니다. 실험 결과는 제안 모델이 병렬 FastPitch와 비슷한 음질을 생성하면서도 실시간 음성 응용 프로그램에 더 낮은 응답 시간을 허용하는 상당히 낮은 지연 시간을 달성할 수 있음을 보여줍니다.

English

Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.

점진적 FastPitch: 청크 기반 고품질 텍스트 음성 변환

Incremental FastPitch: Chunk-based High Quality Text to Speech

초록

Support