LLMVoX：任意のLLM向けオートリグレッシブストリーミングテキスト音声合成モデル

要旨

最近の音声対話システムの進展では、大規模言語モデル（LLM）を活用したマルチモーダルインタラクションが注目されていますが、ファインチューニングの必要性、高い計算コスト、テキストと音声の不一致といった課題が依然として存在しています。既存の音声対応LLMは、LLMを改変することで会話品質が低下し、言語能力が損なわれることが多いです。これに対し、我々はLLMVoXを提案します。これは、軽量な3000万パラメータのLLM非依存型の自己回帰型ストリーミングTTSシステムで、低遅延で高品質な音声を生成しながら、基盤となるLLMの能力を完全に保持します。本手法は、音声対応LLMと比較して大幅に低い単語誤り率を達成し、同等の遅延時間とUTMOSスコアを維持します。マルチキュー型トークンストリーミングシステムにより音声合成とLLM処理を分離することで、LLMVoXはシームレスで無限長の対話をサポートします。また、プラグアンドプレイ設計により、異なるバックボーンを用いた様々なタスクへの拡張が容易です。さらに、LLMVoXはデータセット適応のみで新しい言語に一般化でき、アラビア語音声タスクにおいて低い文字誤り率を達成しています。加えて、LLMVoXを視覚言語モデルと統合し、追加のマルチモーダルトレーニングを必要とせずに、音声、テキスト、視覚の能力を備えたオムニモデルを構築しました。コードベースとプロジェクトページはhttps://mbzuai-oryx.github.io/LLMVoXで公開されています。

English

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

LLMVoX：任意のLLM向けオートリグレッシブストリーミングテキスト音声合成モデル

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

要旨

Support