LiveSpeech:透過音訊離散碼的自回歸建模實現低延遲零樣本文本轉語音
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
June 5, 2024
作者: Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida
cs.AI
摘要
先前的研究已經展示了使用生成式語言模型在透過神經音訊編解碼器獲取的音訊標記上實現零-shot文本轉語音。然而,將它們適應於低延遲情境仍然具有挑戰性。在本文中,我們提出了LiveSpeech - 一種基於完全自回歸語言模型的零-shot文本轉語音方法,實現輸出音訊的低延遲串流。為了在單個解碼步驟中允許多個標記預測,我們提出了(1)使用考慮每個幀中碼簿貢獻並專注於困難實例的自適應碼簿損失權重,以及(2)將碼簿分組並並行處理群組。實驗顯示我們提出的模型在內容準確性、語者相似度、音訊品質和推理速度方面實現了與最先進基準的競爭結果,同時適用於低延遲串流應用。
English
Prior works have demonstrated zero-shot text-to-speech by using a generative
language model on audio tokens obtained via a neural audio codec. It is still
challenging, however, to adapt them to low-latency scenarios. In this paper, we
present LiveSpeech - a fully autoregressive language model-based approach for
zero-shot text-to-speech, enabling low-latency streaming of the output audio.
To allow multiple token prediction within a single decoding step, we propose
(1) using adaptive codebook loss weights that consider codebook contribution in
each frame and focus on hard instances, and (2) grouping codebooks and
processing groups in parallel. Experiments show our proposed models achieve
competitive results to state-of-the-art baselines in terms of content accuracy,
speaker similarity, audio quality, and inference speed while being suitable for
low-latency streaming applications.Summary
AI-Generated Summary