LLMVoX: 임의의 LLM을 위한 자동회귀 스트리밍 텍스트-투-스피치 모델

초록

최근 음성 대 음성 대화 시스템의 발전은 다중 모달 상호작용을 위해 LLM(Large Language Model)을 활용하고 있지만, 여전히 미세 조정 요구사항, 높은 계산 오버헤드, 그리고 텍스트-음성 불일치 문제에 직면해 있습니다. 기존의 음성 지원 LLM은 LLM을 수정함으로써 대화 품질을 저하시키고, 이로 인해 언어적 능력을 훼손하는 경우가 많습니다. 이에 반해, 우리는 LLMVoX를 제안합니다. 이는 경량의 30M 파라미터, LLM에 독립적이며, 자동회귀 스트리밍 TTS(Text-to-Speech) 시스템으로, 낮은 지연 시간으로 고품질 음성을 생성하면서도 기본 LLM의 능력을 완전히 보존합니다. 우리의 접근 방식은 음성 지원 LLM에 비해 상당히 낮은 단어 오류율(Word Error Rate)을 달성하며, 비슷한 지연 시간과 UTMOS 점수를 유지합니다. 다중 큐 토큰 스트리밍 시스템을 통해 음성 합성을 LLM 처리와 분리함으로써, LLMVoX는 원활하고 무한 길이의 대화를 지원합니다. 또한, 플러그 앤 플레이 설계로 다양한 백본을 사용한 다양한 작업으로의 확장이 용이합니다. 더 나아가, LLMVoX는 데이터셋 적응만으로 새로운 언어에 일반화되며, 아랍어 음성 작업에서 낮은 문자 오류율(Character Error Rate)을 달성합니다. 추가적으로, 우리는 LLMVoX를 Vision-Language Model과 통합하여 추가적인 다중 모달 학습 없이도 음성, 텍스트, 비전 기능을 갖춘 올모델(omni-model)을 생성했습니다. 우리의 코드 베이스와 프로젝트 페이지는 https://mbzuai-oryx.github.io/LLMVoX에서 확인할 수 있습니다.

English

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

LLMVoX: 임의의 LLM을 위한 자동회귀 스트리밍 텍스트-투-스피치 모델

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

초록

Support