RALL-E: テキスト音声合成のための連鎖的思考プロンプトを用いたロバストなコーデック言語モデリング

要旨

我々は、テキスト音声合成（TTS）のための堅牢な言語モデリング手法であるRALL-Eを提案する。大規模言語モデル（LLM）に基づく従来の研究は、ゼロショットTTSにおいて印象的な性能を示しているが、そのような手法は、言語モデルの自己回帰的な予測スタイルに起因して、不安定なプロソディ（奇妙なピッチやリズム/長さ）や高い単語誤り率（WER）といった堅牢性の低さに悩まされることが多い。RALL-Eの核心となるアイデアは、連鎖的思考（CoT）プロンプティングであり、タスクをより単純なステップに分解することで、LLMベースのTTSの堅牢性を向上させる。このアイデアを実現するために、RALL-Eはまず入力テキストのプロソディ特徴（ピッチと長さ）を予測し、それらを中間条件としてCoTスタイルで音声トークンを予測する。次に、RALL-Eは予測された長さプロンプトを利用して、Transformerの自己注意重みの計算を導き、音声トークンを予測する際に対応する音素とプロソディ特徴にモデルが集中するように強制する。包括的な客観的および主観的評価の結果、強力なベースライン手法であるVALL-Eと比較して、RALL-EはゼロショットTTSのWERを、リランキングなしの場合6.3%から2.8%へ、リランキングありの場合2.1%から1.0%へと大幅に改善することが示された。さらに、RALL-EはVALL-Eにとって難しい文を正しく合成し、誤り率を68%から4%に削減することも実証した。

English

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 6.3% (without reranking) and 2.1% (with reranking) to 2.8% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

RALL-E: テキスト音声合成のための連鎖的思考プロンプトを用いたロバストなコーデック言語モデリング

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

要旨

Support