提升大型語言模型用於語音合成：一項實證研究

Boosting Large Language Model for Speech Synthesis: An Empirical Study

December 30, 2023

作者: Hongkun Hao, Long Zhou, Shujie Liu, Jinyu Li, Shujie Hu, Rui Wang, Furu Wei

cs.AI

摘要

大型語言模型（LLMs）在自然語言處理方面取得了顯著進展，同時將語言能力擴展到其他形式，如語音和視覺。然而，先前的大部分工作集中在以感知能力（如聽覺理解）提示LLMs，對於如何有效地增強LLMs的語音合成能力的方法仍不明確。本文通過結合預先訓練的LLM LLaMA/OPT和文本轉語音合成模型VALL-E，對提升LLMs生成語音的能力進行了全面的實證探索。我們比較了LLMs和語音合成模型之間的三種整合方法，包括直接微調LLMs、LLMs和VALL-E的疊加層，以及使用LLMs作為強大文本編碼器的耦合LLMs和VALL-E。實驗結果表明，直接使用LoRA方法對LLMs進行微調以提升語音合成能力效果不佳，而疊加LLMs和VALL-E可以提高生成語音的質量，無論是在語者相似性還是字錯誤率（WER）方面。在這三種方法中，利用LLMs作為文本編碼器的耦合方法可以達到最佳性能，使其在語者相似性和顯著（10.9%）WER降低方面優於原始語音合成模型。

English

Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and VALL-E can improve the quality of generated speech both in speaker similarity and word error rate (WER). Among these three methods, coupled methods leveraging LLMs as the text encoder can achieve the best performance, making it outperform original speech synthesis models with a consistently better speaker similarity and a significant (10.9%) WER reduction.