ChatPaper.aiChatPaper

RALL-E:具有思維鏈提示的強健編解碼器語言建模,用於文本轉語音合成。

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

April 4, 2024
作者: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao
cs.AI

摘要

我們提出了 RALL-E,一種用於文本轉語音(TTS)合成的強健語言建模方法。儘管先前基於大型語言模型(LLMs)的工作在零-shot TTS上表現出色,但這類方法常常因語言模型的自回歸預測風格而遭遇到較差的魯棒性,例如不穩定的韻律(奇怪的音高和節奏/持續時間)和較高的詞錯誤率(WER)。RALL-E背後的核心思想是「思維鏈」(CoT)提示,將任務分解為更簡單的步驟以增強基於LLM的TTS的魯棒性。為實現這一思想,RALL-E首先預測輸入文本的韻律特徵(音高和持續時間),並將其用作在CoT風格中預測語音標記的中間條件。其次,RALL-E利用預測的持續時間提示來引導Transformer中自注意力權重的計算,以強制模型在預測語音標記時專注於相應的音素和韻律特徵。全面客觀和主觀評估的結果表明,與強大的基線方法VALL-E相比,RALL-E顯著改善了零-shot TTS的WER,分別從6.3%(無重新排序)和2.1%(重新排序)降至2.8%和1.0%。此外,我們展示了RALL-E能夠正確合成對VALL-E來說困難的句子,並將錯誤率從68%降低到4%。
English
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 6.3% (without reranking) and 2.1% (with reranking) to 2.8% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

Summary

AI-Generated Summary

PDF100December 15, 2024