ChatPaper.aiChatPaper

LLMVoX:適用於任何大型語言模型的自回歸串流文字轉語音模型

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

March 6, 2025
作者: Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
cs.AI

摘要

近期,語音對語音對話系統的進展利用大型語言模型(LLMs)實現多模態互動,但仍受制於微調需求、高計算開銷及文本語音不對齊等問題。現有的語音增強型LLMs常因修改LLM而降低對話質量,從而損害其語言能力。與此不同,我們提出了LLMVoX,這是一個輕量級、僅需3000萬參數、與LLM無關的自迴歸流式文本轉語音(TTS)系統,能在保持基礎LLM全部能力的同時,以低延遲生成高質量語音。相比語音增強型LLMs,我們的方法在保持相近延遲和UTMOS分數的基礎上,實現了顯著更低的詞錯誤率。通過多隊列令牌流系統將語音合成與LLM處理解耦,LLMVoX支持無縫、無限長度的對話。其即插即用設計也便於擴展至不同骨幹網絡的各種任務。此外,LLMVoX僅需數據集適應即可泛化至新語言,在阿拉伯語語音任務中達到了低字符錯誤率。我們還將LLMVoX與視覺語言模型集成,創建了一個具備語音、文本和視覺能力的全能模型,無需額外的多模態訓練。我們的代碼庫和項目頁面可在https://mbzuai-oryx.github.io/LLMVoX 獲取。
English
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

Summary

AI-Generated Summary

PDF705March 7, 2025