聲音颶風：高效並行音頻生成

摘要

我們提出了SoundStorm，一種用於高效、非自回歸音頻生成的模型。SoundStorm的輸入是AudioLM的語義標記，並依賴雙向注意力和基於信心的平行解碼，以生成神經音頻編解碼器的標記。相較於AudioLM的自回歸生成方法，我們的模型在相同音質下，具有更高的語音和聲學條件一致性，同時速度快了兩個數量級。在TPU-v4上，SoundStorm在0.5秒內生成30秒的音頻。我們展示了我們的模型通過合成高質量、自然對話片段，可以將音頻生成擴展到更長的序列，只需提供帶有演講者轉換標註和簡短提示的文本。

English

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

聲音颶風：高效並行音頻生成

SoundStorm: Efficient Parallel Audio Generation

摘要

Support