ミニオムニ：言語モデルは、ストリーミング中に聞き、話しながら考えることができる

要旨

言語モデルの最近の進歩により、大きな進歩が達成されました。新たな里程碑であるGPT-4oは、人間とのリアルタイム会話を可能にし、ほぼ人間に匹敵する自然な流暢さを示しました。このような人間とコンピュータの相互作用には、音声モダリティで直接推論を行い、ストリーミングで出力を生成する能力を持つモデルが必要とされます。しかし、これは現在の学術モデルの到達範囲を超えており、通常は音声合成のために追加のTTSシステムに依存しているため、望ましくない遅延が生じています。本論文では、オーディオベースのエンドツーエンド会話モデルであるMini-Omniを紹介し、リアルタイム音声インタラクションが可能です。この能力を達成するために、テキスト指示音声生成方法を提案し、推論時にはバッチ並列戦略を採用してパフォーマンスをさらに向上させます。また、当社の手法は、他の研究がリアルタイムインタラクション機能を確立するのに役立ち、元のモデルの言語能力を最小限に低下させることなく維持するのにも役立ちます。このトレーニング手法を「Any Model Can Talk」と呼びます。また、音声出力に最適化されたモデルを微調整するためのVoiceAssistant-400Kデータセットを紹介します。Mini-Omniは、リアルタイム音声インタラクションのための最初の完全なエンドツーエンド、オープンソースモデルであり、将来の研究に貴重な可能性を提供しています。

English

Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model's language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method "Any Model Can Talk". We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.

ミニオムニ：言語モデルは、ストリーミング中に聞き、話しながら考えることができる

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

要旨

Support