ChatPaper.aiChatPaper

LLaMA-Omni:大型语言模型的无缝语音交互

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

September 10, 2024
作者: Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng
cs.AI

摘要

像GPT-4o这样的模型使得用户能够通过语音与大型语言模型(LLMs)进行实时交互,与传统基于文本的交互相比,显著提升了用户体验。然而,目前对如何基于开源LLMs构建语音交互模型的探索仍然不足。为了解决这一问题,我们提出了LLaMA-Omni,这是一种专为与LLMs进行低延迟和高质量语音交互而设计的新型模型架构。LLaMA-Omni集成了一个预训练的语音编码器、一个语音适配器、一个LLM和一个流式语音解码器。它消除了对语音转录的需求,并且可以同时从语音指令直接生成文本和语音响应,具有极低的延迟。我们基于最新的Llama-3.1-8B-Instruct模型构建了我们的模型。为了使模型与语音交互场景相匹配,我们构建了一个名为InstructS2S-200K的数据集,其中包括20万个语音指令和相应的语音响应。实验结果表明,与先前的语音-语言模型相比,LLaMA-Omni在内容和风格上提供了更好的响应,响应延迟低至226毫秒。此外,仅需4个GPU,训练LLaMA-Omni不到3天的时间,为未来高效开发语音-语言模型铺平了道路。
English
Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.

Summary

AI-Generated Summary

PDF585November 16, 2024