LiveSpeech:通过音频离散编码的自回归建模实现低延迟零样本文本转语音
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
June 5, 2024
作者: Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida
cs.AI
摘要
先前的研究已经通过使用生成式语言模型在通过神经音频编解码器获得的音频标记上实现了零-shot文本转语音。然而,将它们调整以适应低延迟场景仍然具有挑战性。在本文中,我们提出了LiveSpeech - 一种基于完全自回归语言模型的零-shot文本转语音方法,实现了输出音频的低延迟流式传输。为了允许在单个解码步骤内进行多个标记预测,我们提出了两点:(1) 使用考虑每帧中码书贡献并专注于困难实例的自适应码书损失权重,以及(2) 对码书进行分组并并行处理组。实验证明,我们提出的模型在内容准确性、说话者相似度、音频质量和推理速度方面取得了与最先进基线模型竞争力相当的结果,同时适用于低延迟流式应用。
English
Prior works have demonstrated zero-shot text-to-speech by using a generative
language model on audio tokens obtained via a neural audio codec. It is still
challenging, however, to adapt them to low-latency scenarios. In this paper, we
present LiveSpeech - a fully autoregressive language model-based approach for
zero-shot text-to-speech, enabling low-latency streaming of the output audio.
To allow multiple token prediction within a single decoding step, we propose
(1) using adaptive codebook loss weights that consider codebook contribution in
each frame and focus on hard instances, and (2) grouping codebooks and
processing groups in parallel. Experiments show our proposed models achieve
competitive results to state-of-the-art baselines in terms of content accuracy,
speaker similarity, audio quality, and inference speed while being suitable for
low-latency streaming applications.Summary
AI-Generated Summary