VITA-Audio: 効率的な大規模音声言語モデルのための高速インターリーブ型クロスモーダルトークン生成

要旨

自然な人間とコンピュータのインタラクションに対する要求が高まる中、音声は日常的なコミュニケーションの最も一般的な形態の一つであるため、音声ベースのシステムが注目を集めています。しかし、既存の音声モデルは、ストリーミング中に最初のオーディオトークンを生成する際に高いレイテンシを経験しており、これが展開における重大なボトルネックとなっています。この問題に対処するため、我々はVITA-Audioを提案します。これは、高速なオーディオ-テキストトークン生成を可能にするエンドツーエンドの大規模音声モデルです。具体的には、軽量なMultiple Cross-modal Token Prediction (MCTP)モジュールを導入し、単一のモデルフォワードパス内で複数のオーディオトークンを効率的に生成します。これにより、推論が加速されるだけでなく、ストリーミングシナリオにおける最初のオーディオ生成のレイテンシが大幅に削減されます。さらに、音声品質の最小限の損失でモデルの加速を実現するために、4段階のプログレッシブトレーニング戦略を探求します。我々の知る限り、VITA-Audioは最初のフォワードパス中にオーディオ出力を生成できる最初のマルチモーダル大規模言語モデルであり、最小限のレイテンシでリアルタイムの会話能力を可能にします。VITA-Audioは完全に再現可能であり、オープンソースのデータのみでトレーニングされています。実験結果は、我々のモデルが7Bパラメータスケールで3~5倍の推論速度向上を達成するだけでなく、自動音声認識（ASR）、テキストトゥスピーチ（TTS）、音声質問応答（SQA）タスクの複数のベンチマークにおいて、類似のモデルサイズのオープンソースモデルを大幅に上回ることを示しています。

English

With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

VITA-Audio: 効率的な大規模音声言語モデルのための高速インターリーブ型クロスモーダルトークン生成

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

要旨

Support