VibeVoice技术报告

摘要

本报告介绍了VibeVoice，一种创新模型，旨在通过采用下一令牌扩散技术合成多说话者的长篇语音。该技术是一种统一方法，通过扩散自回归生成潜在向量来建模连续数据。为此，我们引入了一种新型的连续语音令牌化器，与流行的Encodec模型相比，在保持相当性能的同时，将数据压缩率提高了80倍。该令牌化器在显著提升长序列处理计算效率的同时，有效保持了音频保真度。因此，VibeVoice能够在64K上下文窗口长度内合成长达90分钟、最多包含4位说话者的长篇语音，捕捉真实的对话“氛围”，并超越了开源及专有对话模型的表现。

English

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.