LLMVoX:適用於任何大型語言模型的自回歸串流文字轉語音模型
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
March 6, 2025
作者: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
cs.AI
摘要
近期,語音對語音對話系統的進展利用大型語言模型(LLMs)實現多模態互動,但仍受制於微調需求、高計算開銷及文本語音不對齊等問題。現有的語音增強型LLMs常因修改LLM而降低對話質量,從而損害其語言能力。與此不同,我們提出了LLMVoX,這是一個輕量級、僅需3000萬參數、與LLM無關的自迴歸流式文本轉語音(TTS)系統,能在保持基礎LLM全部能力的同時,以低延遲生成高質量語音。相比語音增強型LLMs,我們的方法在保持相近延遲和UTMOS分數的基礎上,實現了顯著更低的詞錯誤率。通過多隊列令牌流系統將語音合成與LLM處理解耦,LLMVoX支持無縫、無限長度的對話。其即插即用設計也便於擴展至不同骨幹網絡的各種任務。此外,LLMVoX僅需數據集適應即可泛化至新語言,在阿拉伯語語音任務中達到了低字符錯誤率。我們還將LLMVoX與視覺語言模型集成,創建了一個具備語音、文本和視覺能力的全能模型,無需額外的多模態訓練。我們的代碼庫和項目頁面可在https://mbzuai-oryx.github.io/LLMVoX 獲取。
English
Recent advancements in speech-to-speech dialogue systems leverage LLMs for
multimodal interactions, yet they remain hindered by fine-tuning requirements,
high computational overhead, and text-speech misalignment. Existing
speech-enabled LLMs often degrade conversational quality by modifying the LLM,
thereby compromising its linguistic capabilities. In contrast, we propose
LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS
system that generates high-quality speech with low latency, while fully
preserving the capabilities of the base LLM. Our approach achieves a
significantly lower Word Error Rate compared to speech-enabled LLMs, while
operating at comparable latency and UTMOS score. By decoupling speech synthesis
from LLM processing via a multi-queue token streaming system, LLMVoX supports
seamless, infinite-length dialogues. Its plug-and-play design also facilitates
extension to various tasks with different backbones. Furthermore, LLMVoX
generalizes to new languages with only dataset adaptation, attaining a low
Character Error Rate on an Arabic speech task. Additionally, we have integrated
LLMVoX with a Vision-Language Model to create an omni-model with speech, text,
and vision capabilities, without requiring additional multimodal training. Our
code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .Summary
AI-Generated Summary