聲音颶風:高效並行音頻生成
SoundStorm: Efficient Parallel Audio Generation
May 16, 2023
作者: Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi
cs.AI
摘要
我們提出了SoundStorm,一種用於高效、非自回歸音頻生成的模型。SoundStorm的輸入是AudioLM的語義標記,並依賴雙向注意力和基於信心的平行解碼,以生成神經音頻編解碼器的標記。相較於AudioLM的自回歸生成方法,我們的模型在相同音質下,具有更高的語音和聲學條件一致性,同時速度快了兩個數量級。在TPU-v4上,SoundStorm在0.5秒內生成30秒的音頻。我們展示了我們的模型通過合成高質量、自然對話片段,可以將音頻生成擴展到更長的序列,只需提供帶有演講者轉換標註和簡短提示的文本。
English
We present SoundStorm, a model for efficient, non-autoregressive audio
generation. SoundStorm receives as input the semantic tokens of AudioLM, and
relies on bidirectional attention and confidence-based parallel decoding to
generate the tokens of a neural audio codec. Compared to the autoregressive
generation approach of AudioLM, our model produces audio of the same quality
and with higher consistency in voice and acoustic conditions, while being two
orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5
seconds on a TPU-v4. We demonstrate the ability of our model to scale audio
generation to longer sequences by synthesizing high-quality, natural dialogue
segments, given a transcript annotated with speaker turns and a short prompt
with the speakers' voices.