Llasa：為基於 Llama 的語音合成擴展訓練時間和推論時間計算

摘要

最近在基於文本的大型語言模型（LLMs）方面取得了重大進展，特別是在GPT系列和o1模型中，展示了在訓練時間和推理時間計算方面進行擴展的有效性。然而，目前最先進的文本轉語音（TTS）系統利用LLMs通常是多階段的，需要單獨的模型（例如，在LLM之後的擴散模型），這使得在訓練或測試期間擴展特定模型的決策變得複雜。本研究提出以下貢獻：首先，我們探索了語音合成的訓練時間和推理時間計算的擴展。其次，我們提出了一個名為Llasa的簡單框架，用於語音合成，該框架採用了單層向量量化器（VQ）編解碼器和單個Transformer架構，以完全符合標準的LLMs，如Llama。我們的實驗顯示，對於Llasa進行訓練時間計算的擴展一致地提高了合成語音的自然度，並實現了更複雜和準確的韻律模式生成。此外，從推理時間計算的擴展角度來看，我們在搜索過程中利用語音理解模型作為驗證者，發現推理時間計算的擴展將取樣模式轉向特定驗證者的偏好，從而提高了情感表達力、音色一致性和內容準確性。此外，我們公開發布了我們的TTS模型（1B、3B、8B）和編解碼器模型的檢查點和訓練代碼。

English

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

Llasa：為基於 Llama 的語音合成擴展訓練時間和推論時間計算

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

摘要

Support