LLMVoX：適用於任何大型語言模型的自回歸串流文字轉語音模型

摘要

近期，語音對語音對話系統的進展利用大型語言模型（LLMs）實現多模態互動，但仍受制於微調需求、高計算開銷及文本語音不對齊等問題。現有的語音增強型LLMs常因修改LLM而降低對話質量，從而損害其語言能力。與此不同，我們提出了LLMVoX，這是一個輕量級、僅需3000萬參數、與LLM無關的自迴歸流式文本轉語音（TTS）系統，能在保持基礎LLM全部能力的同時，以低延遲生成高質量語音。相比語音增強型LLMs，我們的方法在保持相近延遲和UTMOS分數的基礎上，實現了顯著更低的詞錯誤率。通過多隊列令牌流系統將語音合成與LLM處理解耦，LLMVoX支持無縫、無限長度的對話。其即插即用設計也便於擴展至不同骨幹網絡的各種任務。此外，LLMVoX僅需數據集適應即可泛化至新語言，在阿拉伯語語音任務中達到了低字符錯誤率。我們還將LLMVoX與視覺語言模型集成，創建了一個具備語音、文本和視覺能力的全能模型，無需額外的多模態訓練。我們的代碼庫和項目頁面可在https://mbzuai-oryx.github.io/LLMVoX 獲取。

English

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

LLMVoX：適用於任何大型語言模型的自回歸串流文字轉語音模型

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

摘要

Support