SoundStorm: 효율적인 병렬 오디오 생성

초록

우리는 효율적이고 비자기회귀적인 오디오 생성을 위한 SoundStorm 모델을 소개한다. SoundStorm은 AudioLM의 의미론적 토큰을 입력으로 받으며, 양방향 주의 메커니즘과 신뢰도 기반 병렬 디코딩을 통해 신경 오디오 코덱의 토큰을 생성한다. AudioLM의 자기회귀적 생성 방식과 비교했을 때, 우리의 모델은 동일한 품질의 오디오를 생성하면서도 목소리와 음향 조건에서 더 높은 일관성을 보이며, 두 배 빠른 속도를 자랑한다. SoundStorm은 TPU-v4에서 0.5초 만에 30초 길이의 오디오를 생성한다. 우리는 화자 전환과 각 화자의 목소리를 담은 짧은 프롬프트가 포함된 대본을 제공받아 고품질의 자연스러운 대화 세그먼트를 합성함으로써, 더 긴 시퀀스로 오디오 생성을 확장할 수 있는 우리 모델의 능력을 입증한다.

English

We present SoundStorm, a model for efficient, non-autoregressive audio generation. SoundStorm receives as input the semantic tokens of AudioLM, and relies on bidirectional attention and confidence-based parallel decoding to generate the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We demonstrate the ability of our model to scale audio generation to longer sequences by synthesizing high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers' voices.

SoundStorm: 효율적인 병렬 오디오 생성

SoundStorm: Efficient Parallel Audio Generation

초록

Support