ChatPaper.aiChatPaper

小全聽:語言模型可以在串流中聽、說話並思考

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

August 29, 2024
作者: Zhifei Xie, Changqiao Wu
cs.AI

摘要

語言模型的最新進展取得了顯著的進步。作為一個新的里程碑,GPT-4o實現了與人類的實時對話,展現了接近人類自然流暢的能力。這種人機互動需要具備直接進行推理並以串流形式生成輸出的模型。然而,目前的學術模型仍難以做到這一點,因為它們通常依賴額外的TTS系統進行語音合成,導致不必要的延遲。本文介紹了Mini-Omni,一種基於音頻的端到端對話模型,能夠實現實時語音互動。為了實現這一能力,我們提出了一種文本指導的語音生成方法,並在推理過程中採用批量並行策略以進一步提高性能。我們的方法還有助於保留原始模型的語言能力,並最小程度地降低其品質,從而使其他工作能夠建立實時互動的能力。我們將這種訓練方法稱為“任何模型都能說話”。我們還介紹了VoiceAssistant-400K數據集,用於微調針對語音輸出進行優化的模型。據我們所知,Mini-Omni是第一個完全端到端、開源的實時語音互動模型,為未來研究提供了有價值的潛力。
English
Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.

Summary

AI-Generated Summary

PDF546November 16, 2024