イチゴ：混合モーダルアーリーフュージョンリアルタイム音声アシスタント

要旨

大規模言語モデル（LLM）は自然言語処理を革新しましたが、音声ベースのタスクへの適用は、音声とテキストのモダリティを統合する複雑さのために依然として困難です。本論文では、音声とテキストの交互に処理されるシーケンスをシームレスに処理する混合モダルモデルである「イチゴ」を紹介します。トークン化されたアーリーフュージョン手法を利用し、イチゴは音声を離散的なトークンに量子化し、音声とテキストの両方のモダリティに対して一様なトランスフォーマーベースのアーキテクチャを採用しています。この手法により、別個のアダプターを必要とせずに、モダリティ間での共同推論と生成が可能となります。我々は、多言語音声認識データセットでの事前トレーニングと、厳選された指示データセットでのファインチューニングを含む包括的なトレーニング手法を提案します。イチゴは、音声に関する質問応答のベンチマークで最先端の性能を示し、既存のオープンソース音声言語モデルを凌駕し、カスケードシステムと同等の結果を達成します。特筆すべきは、イチゴが最初のトークン生成までのレイテンシがわずか111ミリ秒であり、現行モデルよりも大幅に低いことです。我々のアプローチは、マルチモーダルAIの分野を前進させるだけでなく、小規模な研究チームがオープンソース音声言語モデルに効果的に貢献するためのフレームワークを提供します。

English

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.

イチゴ：混合モーダルアーリーフュージョンリアルタイム音声アシスタント

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

要旨

Support