大型语言模型的增强在语音合成中的应用：一项实证研究

Boosting Large Language Model for Speech Synthesis: An Empirical Study

December 30, 2023

作者: Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei

cs.AI

摘要

大型语言模型（LLMs）在自然语言处理方面取得了重大进展，同时也在将语言能力扩展到其他形式，如语音和视觉。然而，大部分先前的工作侧重于使用听觉理解等感知能力来提示LLMs，而如何有效地增强LLMs的语音合成能力仍然不明确。本文通过将预训练的LLM LLaMA/OPT与文本到语音合成模型VALL-E相结合，对提升LLMs生成语音能力进行了全面的实证探索。我们比较了LLMs和语音合成模型之间的三种集成方法，包括直接微调LLMs、LLMs和VALL-E的叠加层，以及使用LLMs作为强大文本编码器的耦合LLMs和VALL-E。实验结果表明，直接使用LoRA方法对LLMs进行微调以提升语音合成能力并不奏效，而叠加LLMs和VALL-E可以提高生成语音的质量，无论是在说话者相似度还是词错误率（WER）方面。在这三种方法中，利用LLMs作为文本编码器的耦合方法可以实现最佳性能，使其在说话者相似度和显著减少（10.9%）的WER方面优于原始语音合成模型。

English

Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and VALL-E can improve the quality of generated speech both in speaker similarity and word error rate (WER). Among these three methods, coupled methods leveraging LLMs as the text encoder can achieve the best performance, making it outperform original speech synthesis models with a consistently better speaker similarity and a significant (10.9%) WER reduction.