LLaMA-Omni2:基於LLM的即時語音聊天機器人,具備自回歸串流語音合成功能
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
May 5, 2025
作者: Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng
cs.AI
摘要
即時、智能且自然的語音互動是下一代人機交互的核心要素。近期研究展示了基於大型語言模型(LLMs)構建智能語音聊天機器人的潛力。本文介紹了LLaMA-Omni 2,這是一系列參數量從0.5B到14B不等的語音語言模型(SpeechLMs),能夠實現高質量的即時語音互動。LLaMA-Omni 2基於Qwen2.5系列模型,整合了語音編碼器和自迴歸流式語音解碼器。儘管僅在20萬輪多輪語音對話樣本上進行訓練,LLaMA-Omni 2在多個語音問答和語音指令跟蹤基準測試中表現出色,超越了先前如GLM-4-Voice等基於數百萬小時語音數據訓練的頂尖SpeechLMs。
English
Real-time, intelligent, and natural speech interaction is an essential part
of the next-generation human-computer interaction. Recent advancements have
showcased the potential of building intelligent spoken chatbots based on large
language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of
speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable
of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built
upon the Qwen2.5 series models, integrating a speech encoder and an
autoregressive streaming speech decoder. Despite being trained on only 200K
multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong
performance on several spoken question answering and speech instruction
following benchmarks, surpassing previous state-of-the-art SpeechLMs like
GLM-4-Voice, which was trained on millions of hours of speech data.Summary
AI-Generated Summary