Pheme: 효율적이고 대화형 음성 생성

초록

최근 몇 년 동안 음성 생성 기술은 눈부신 발전을 이루며, 이제는 실제 인간의 목소리와 거의 구분할 수 없는 원샷(one-shot) 생성 능력을 달성했습니다. 이러한 음성 생성 기술의 발전을 대규모 언어 모델과 통합한다면 다양한 응용 분야에 혁신을 가져올 수 있습니다. 그러나 보조 대화 시스템과 같은 특정 응용 분야에서는 실시간으로 효율적으로 작동하면서도 자연스럽고 대화체의 음성 생성 도구가 필요합니다. 현재 최첨단 모델인 VALL-E와 SoundStorm는 계층적 신경 오디오 코덱을 기반으로 하며, 효과적으로 작동하기 위해 대규모 신경망 구성 요소와 방대한 양의 학습 데이터를 필요로 합니다. 반면, MQTTS는 더 작은 규모의 실제 대화 음성 데이터를 활용하면서도 더 컴팩트한 대화형 TTS(Text-to-Speech) 모델을 구축하는 것을 목표로 합니다. 그러나 MQTTS의 자기회귀(autoregressive) 특성은 높은 추론 지연을 초래하여 실시간 사용을 제한합니다. 이 연구에서는 최첨단 TTS 모델의 강점을 활용하면서 현재의 한계를 완화하기 위해 Pheme 모델 시리즈를 소개합니다. Pheme 모델 시리즈는 1) 컴팩트하면서도 고성능 모델을 제공하고, 2) 병렬 음성 생성을 가능하게 하며, 3) 자연스러운 대화체 음성을 생성하고, 4) 더 작은 규모의 대화 데이터를 효율적으로 학습할 수 있어 데이터 요구량을 10배 이상 줄이면서도 자기회귀 TTS 모델의 품질을 유지합니다. 또한, 사전 학습된 Pheme 체크포인트 위에 단일 화자 설정에서 훨씬 더 큰 교사 모델이 생성한 합성 음성만을 사용하여 간단한 교사-학생 증류(teacher-student distillation)를 통해 음성 품질을 크게 개선할 수 있음을 보여줍니다. 오디오 샘플과 사전 학습된 모델은 온라인에서 확인할 수 있습니다.

English

In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

Pheme: 효율적이고 대화형 음성 생성

Pheme: Efficient and Conversational Speech Generation

초록

Support