VibeVoice 技術報告

摘要

本報告介紹了VibeVoice，這是一種新穎的模型，旨在通過採用下一令牌擴散技術來合成多說話者的長篇語音。這是一種通過擴散自迴歸生成潛在向量來建模連續數據的統一方法。為實現這一目標，我們引入了一種新型的連續語音令牌化器，與流行的Encodec模型相比，該令牌化器在保持可比性能的同時，將數據壓縮率提高了80倍。該令牌化器在有效保持音頻保真度的同時，顯著提升了處理長序列的計算效率。因此，VibeVoice能夠在64K上下文窗口長度內合成長達90分鐘的語音，最多支持4位說話者，捕捉真實的對話「氛圍」，並超越了開源和專有的對話模型。

English

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

VibeVoice 技術報告

VibeVoice Technical Report

摘要

Support