HierSpeech++: 계층적 변분 추론을 통한 의미적 표현과 음향적 표현 간의 간극 해소 - 제로샷 음성 합성을 위한 접근

초록

대규모 언어 모델(LLM) 기반 음성 합성은 제로샷 음성 합성에서 널리 채택되고 있습니다. 그러나 이러한 모델은 대규모 데이터를 필요로 하며, 느린 추론 속도와 견고성 부족 등 이전의 자기회귀적 음성 모델과 동일한 한계를 가지고 있습니다. 본 논문은 텍스트-음성 변환(TTS) 및 음성 변환(VC)을 위한 빠르고 강력한 제로샷 음성 합성기인 HierSpeech++를 제안합니다. 우리는 계층적 음성 합성 프레임워크가 합성 음성의 견고성과 표현력을 크게 향상시킬 수 있음을 검증했습니다. 또한, 제로샷 음성 합성 시나리오에서도 합성 음성의 자연스러움과 화자 유사성을 크게 개선했습니다. 텍스트-음성 변환의 경우, 텍스트 표현과 운율 프롬프트를 기반으로 자기 지도 학습 음성 표현과 F0 표현을 생성하는 텍스트-벡터 프레임워크를 채택했습니다. 그런 다음, HierSpeech++는 생성된 벡터, F0, 그리고 음성 프롬프트로부터 음성을 생성합니다. 또한, 16 kHz에서 48 kHz로의 고효율 음성 초해상도 프레임워크를 도입했습니다. 실험 결과, 계층적 변분 자동인코더가 LLM 기반 및 확산 기반 모델을 능가하는 강력한 제로샷 음성 합성기가 될 수 있음을 입증했습니다. 더 나아가, 우리는 최초로 인간 수준의 품질을 가진 제로샷 음성 합성을 달성했습니다. 오디오 샘플과 소스 코드는 https://github.com/sh-lee-prml/HierSpeechpp에서 확인할 수 있습니다.

English

Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp.

HierSpeech++: 계층적 변분 추론을 통한 의미적 표현과 음향적 표현 간의 간극 해소 - 제로샷 음성 합성을 위한 접근

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

초록

Support