ChatPaper.aiChatPaper

LLaMA-Omni2:基于大语言模型的实时语音聊天机器人,具备自回归流式语音合成功能

LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

May 5, 2025
作者: Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng
cs.AI

摘要

实时、智能且自然的语音交互是下一代人机交互的核心组成部分。近期研究进展展示了基于大语言模型(LLMs)构建智能语音聊天机器人的潜力。本文介绍了LLaMA-Omni 2,一系列参数规模从0.5B到14B不等的语音语言模型(SpeechLMs),能够实现高质量的实时语音交互。LLaMA-Omni 2基于Qwen2.5系列模型构建,集成了语音编码器和自回归流式语音解码器。尽管仅训练了20万轮多轮语音对话样本,LLaMA-Omni 2在多个语音问答和语音指令跟随基准测试中表现出色,超越了之前基于数百万小时语音数据训练的顶尖SpeechLMs,如GLM-4-Voice。
English
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

Summary

AI-Generated Summary

PDF71May 6, 2025