RALL-E: 텍스트-음성 합성을 위한 사고 사슬 프롬프팅 기반 강건한 코덱 언어 모델링

초록

우리는 텍스트-음성 변환(TTS) 합성을 위한 강건한 언어 모델링 방법인 RALL-E를 소개한다. 대규모 언어 모델(LLM)에 기반한 기존 연구들은 제로샷 TTS에서 인상적인 성능을 보여주지만, 이러한 방법들은 언어 모델의 자기회귀적 예측 방식으로 인해 불안정한 운율(이상한 피치와 리듬/지속 시간)과 높은 단어 오류율(WER)과 같은 낮은 강건성을 보이는 경우가 많다. RALL-E의 핵심 아이디어는 사고의 연쇄(CoT) 프롬프팅으로, 이를 통해 작업을 더 간단한 단계로 분해하여 LLM 기반 TTS의 강건성을 향상시킨다. 이를 실현하기 위해 RALL-E는 먼저 입력 텍스트의 운율 특성(피치와 지속 시간)을 예측하고 이를 중간 조건으로 사용하여 CoT 스타일로 음성 토큰을 예측한다. 두 번째로, RALL-E는 예측된 지속 시간 프롬프트를 활용하여 Transformer의 자기 주의 가중치 계산을 안내함으로써 모델이 음성 토큰을 예측할 때 해당 음소와 운율 특성에 집중하도록 강제한다. 포괄적인 객관적 및 주관적 평가 결과에 따르면, 강력한 베이스라인 방법인 VALL-E와 비교하여 RALL-E는 제로샷 TTS의 WER을 각각 6.3%(재순위 없음)와 2.1%(재순위 있음)에서 2.8%와 1.0%로 크게 개선했다. 또한, RALL-E는 VALL-E가 처리하기 어려운 문장을 정확하게 합성하며 오류율을 68%에서 4%로 감소시킨다는 것을 입증했다.

English

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 6.3% (without reranking) and 2.1% (with reranking) to 2.8% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

RALL-E: 텍스트-음성 합성을 위한 사고 사슬 프롬프팅 기반 강건한 코덱 언어 모델링

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

초록

Support