SoundStorm：効率的な並列音声生成

要旨

SoundStormを紹介します。これは効率的で非自己回帰的な音声生成モデルです。SoundStormは、AudioLMの意味トークンを入力として受け取り、双方向アテンションと信頼度ベースの並列デコードを利用して、ニューラル音声コーデックのトークンを生成します。AudioLMの自己回帰的生成アプローチと比較して、当モデルは同じ品質の音声を生成しつつ、声や音響条件の一貫性が高く、生成速度は2桁高速です。SoundStormは、TPU-v4上で0.5秒で30秒の音声を生成します。また、話者交代を注釈したトランスクリプトと話者の声の短いプロンプトを与えることで、高品質で自然な対話セグメントを合成し、長いシーケンスへの音声生成のスケーリング能力を実証します。

English

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

SoundStorm：効率的な並列音声生成

SoundStorm: Efficient Parallel Audio Generation

要旨

Support