音声対話モデル

要旨

音声は本質的にインタラクティブなモダリティであるが、今日の大規模音声言語モデル（LALM）はオフラインであり、ストリーミング音声モデルはそれぞれストリーミングASRや音声チャットのような単一のタスクしか扱っていない。これらを1つのオンラインLALMに統合する時である。すなわち、常時動作する知覚-判断-応答ループを通じて、音、環境、指示をリアルタイムに聞き取り、即座に反応するモデルである。我々はこの枠組みを音声対話モデル（Audio Interaction Model）として定式化し、オフラインタスクの実行を維持しつつ、対話から本格的な音声チャットに至るまでオンラインの汎用音声指示追従を追加し、ストリームの意味から応答タイミングを決定する統合ストリーミングモデルであるAudio-Interactionによってこれを実現する。これを可能にするために、我々はSoundFlowを提案する。SoundFlowは、ストリーミングに特化したデータ構築、理解を考慮したトレーニング、そして安定したリアルタイム対話のための非同期低レイテンシ推論を通じて、知覚-判断-応答ループをデータからトレーニング、デプロイメントまでエンドツーエンドで具現化するフレームワークである。さらに、7つの基本能力と28のサブタスクにわたる260万項目のストリーミングコーパスであるStreamAudio-2Mと、プロアクティブな音声介入を評価するためのProactive-Sound-Benchを構築する。8つのベンチマークにおいて、Audio-Interactionは主流の音声タスクで競争力のある性能を維持しつつ、リアルタイムASR、ストリーミング音声指示追従、プロアクティブな支援など、オフラインLALMでは不可能な能力を解放する。

English

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.