LLaMA-Omni: 大規模言語モデルとのシームレスな音声インタラクション

要旨

GPT-4oのようなモデルは、大規模言語モデル（LLMs）とのリアルタイムインタラクションを音声を通じて可能にし、従来のテキストベースのインタラクションと比較してユーザーエクスペリエンスを大幅に向上させます。しかし、オープンソースのLLMsを基にした音声インタラクションモデルの構築方法にはまだ十分な探索がされていません。この課題に対処するために、低遅延かつ高品質な音声インタラクションをLLMsと行うために設計された革新的なモデルアーキテクチャ、LLaMA-Omniを提案します。LLaMA-Omniは、事前学習された音声エンコーダ、音声アダプタ、LLM、およびストリーミング音声デコーダを統合しています。音声転写の必要性を排除し、音声指示から直接テキストと音声応答を極めて低い遅延で生成することができます。私たちは、最新のLlama-3.1-8B-Instructモデルに基づいてモデルを構築しています。音声インタラクションシナリオにモデルを整合させるために、InstructS2S-200Kというデータセットを構築しました。このデータセットには、20万の音声指示とそれに対応する音声応答が含まれています。実験結果によると、従来の音声言語モデルと比較して、LLaMA-Omniはコンテンツとスタイルの両方でより良い応答を提供し、応答遅延は226msまで低下しています。さらに、LLaMA-Omniのトレーニングにはたった4つのGPUで3日未満しかかからず、将来の効率的な音声言語モデルの開発の道を切り開いています。

English

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.

LLaMA-Omni: 大規模言語モデルとのシームレスな音声インタラクション

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

要旨

Support