声音风暴：高效并行音频生成

摘要

我们提出了SoundStorm，这是一个用于高效、非自回归音频生成的模型。SoundStorm接收AudioLM的语义标记作为输入，并依赖双向注意力和基于置信度的并行解码来生成神经音频编解码器的标记。与AudioLM的自回归生成方法相比，我们的模型在相同质量下产生声音，并具有更高的语音和声学条件一致性，同时速度快两个数量级。SoundStorm在TPU-v4上能够在0.5秒内生成30秒的音频。我们展示了我们的模型通过合成高质量、自然对话片段，从一个带有发言者转换注释和发言者声音的简短提示的抄本中，展示了音频生成扩展到更长序列的能力。

English

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

声音风暴：高效并行音频生成

SoundStorm: Efficient Parallel Audio Generation

摘要

Support