Mini-Omni:语言模型可以在流式处理中听、说话并思考
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
August 29, 2024
作者: Zhifei Xie, Changqiao Wu
cs.AI
摘要
最近语言模型的进展取得了显著进步。作为一个新的里程碑,GPT-4o实现了与人类的实时对话,展示出接近人类自然流畅度的表现。这种人机交互需要具备直接进行推理并能够实时生成输出的模型,尤其是在音频模态下。然而,目前的学术模型尚无法做到这一点,因为它们通常依赖额外的TTS系统进行语音合成,导致不可取的延迟。本文介绍了Mini-Omni,一种基于音频的端到端会话模型,能够实现实时语音交互。为了实现这一能力,我们提出了一种文本指导的语音生成方法,并在推理过程中采用批处理并行策略以进一步提升性能。我们的方法还有助于保留原始模型的语言能力,降低最小化退化,使其他工作能够建立实时交互能力。我们将这种训练方法称为“任何模型都会说话”。我们还介绍了VoiceAssistant-400K数据集,用于微调针对语音输出进行优化的模型。据我们所知,Mini-Omni是第一个完全端到端的开源实时语音交互模型,为未来研究提供了宝贵的潜力。
English
Recent advances in language models have achieved significant progress.
GPT-4o, as a new milestone, has enabled real-time conversations with humans,
demonstrating near-human natural fluency. Such human-computer interaction
necessitates models with the capability to perform reasoning directly with the
audio modality and generate output in streaming. However, this remains beyond
the reach of current academic models, as they typically depend on extra TTS
systems for speech synthesis, resulting in undesirable latency. This paper
introduces the Mini-Omni, an audio-based end-to-end conversational model,
capable of real-time speech interaction. To achieve this capability, we propose
a text-instructed speech generation method, along with batch-parallel
strategies during inference to further boost the performance. Our method also
helps to retain the original model's language capabilities with minimal
degradation, enabling other works to establish real-time interaction
capabilities. We call this training method "Any Model Can Talk". We also
introduce the VoiceAssistant-400K dataset to fine-tune models optimized for
speech output. To our best knowledge, Mini-Omni is the first fully end-to-end,
open-source model for real-time speech interaction, offering valuable potential
for future research.Summary
AI-Generated Summary