VibeVoice 기술 보고서

초록

본 보고서는 VibeVoice라는 새로운 모델을 소개합니다. 이 모델은 다중 화자를 포함한 장편 음성 합성을 위해 다음 토큰 확산(next-token diffusion)을 활용합니다. 다음 토큰 확산은 연속 데이터를 모델링하기 위한 통합 방법으로, 확산 과정을 통해 잠재 벡터를 자기회귀적으로 생성합니다. 이를 가능하게 하기 위해, 우리는 새로운 연속 음성 토큰화기를 도입했습니다. 이 토큰화기는 널리 사용되는 Encodec 모델과 비교했을 때 데이터 압축률을 80배 향상시키면서도 비슷한 성능을 유지합니다. 이 토큰화기는 오디오 충실도를 효과적으로 보존하면서도 긴 시퀀스 처리에 대한 계산 효율성을 크게 향상시킵니다. 따라서 VibeVoice는 최대 4명의 화자를 포함하여 최대 90분 길이의 장편 음성(64K 컨텍스트 윈도우 길이 기준)을 합성할 수 있으며, 실제 대화의 "분위기"를 포착하여 오픈소스 및 상용 대화 모델들을 능가합니다.

English

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

VibeVoice 기술 보고서

VibeVoice Technical Report

초록

Support