VITA-Audio: 효율적인 대규모 음성-언어 모델을 위한 빠른 인터리브 교차 모달 토큰 생성

초록

자연스러운 인간-컴퓨터 상호작용에 대한 요구가 증가함에 따라, 음성은 일상적인 의사소통에서 가장 흔한 형태 중 하나로, 음성 기반 시스템이 점점 더 주목받고 있다. 그러나 기존의 음성 모델들은 스트리밍 중 첫 번째 오디오 토큰을 생성할 때 높은 지연 시간을 겪으며, 이는 배포에 있어 중요한 병목 현상으로 작용한다. 이 문제를 해결하기 위해, 우리는 빠른 오디오-텍스트 토큰 생성을 가능하게 하는 종단 간 대형 음성 모델인 VITA-Audio를 제안한다. 구체적으로, 우리는 단일 모델 순방향 전달 내에서 여러 오디오 토큰을 효율적으로 생성하는 경량의 다중 교차 모달 토큰 예측(MCTP) 모듈을 도입하여, 추론 속도를 가속화할 뿐만 아니라 스트리밍 시나리오에서 첫 번째 오디오 생성 지연 시간을 크게 줄인다. 또한, 음성 품질의 최소한의 손실로 모델 가속화를 달성하기 위해 4단계 점진적 학습 전략을 탐구한다. 우리가 아는 한, VITA-Audio는 첫 번째 순방향 전달 중에 오디오 출력을 생성할 수 있는 최초의 다중 모달 대형 언어 모델로, 최소한의 지연 시간으로 실시간 대화 기능을 가능하게 한다. VITA-Audio는 완전히 재현 가능하며 오픈소스 데이터만으로 학습된다. 실험 결과는 우리의 모델이 7B 파라미터 규모에서 3~5배의 추론 속도 향상을 달성할 뿐만 아니라, 자동 음성 인식(ASR), 텍스트-음성 변환(TTS), 음성 질문 응답(SQA) 작업에 대한 여러 벤치마크에서 유사한 모델 크기의 오픈소스 모델을 크게 능가함을 보여준다.

English

With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

VITA-Audio: 효율적인 대규모 음성-언어 모델을 위한 빠른 인터리브 교차 모달 토큰 생성

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

초록

Support