ChatPaper.aiChatPaper

RALL-E:链式思维提示的文本到语音合成的鲁棒编解码器语言建模

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

April 4, 2024
作者: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao
cs.AI

摘要

我们提出了RALL-E,这是一种用于文本转语音(TTS)合成的鲁棒语言建模方法。尽管先前基于大型语言模型(LLMs)的工作在零-shot TTS上表现出色,但这类方法常常存在鲁棒性不佳的问题,如不稳定的韵律(奇怪的音高和节奏/时长)和高词错误率(WER),这是由于语言模型的自回归预测风格所致。RALL-E背后的核心思想是“思维链”(CoT)提示,它将任务分解为更简单的步骤,以增强基于LLM的TTS的鲁棒性。为了实现这一想法,RALL-E首先预测输入文本的韵律特征(音高和时长),并将其用作预测语音标记的中间条件,以CoT风格进行预测。其次,RALL-E利用预测的时长提示来引导Transformer中自注意力权重的计算,以强制模型在预测语音标记时专注于相应的音素和韵律特征。全面客观和主观评估结果表明,与强大的基线方法VALL-E相比,RALL-E显著改善了零-shot TTS的WER,分别从6.3%(无重新排序)和2.1%(重新排序)降至2.8%和1.0%。此外,我们证明RALL-E能够正确合成对VALL-E而言困难的句子,并将错误率从68%降低到4%。
English
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 6.3% (without reranking) and 2.1% (with reranking) to 2.8% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

Summary

AI-Generated Summary

PDF100December 15, 2024