音頻交互模型

摘要

音訊本質上是一種互動式模態，然而現今的大型音訊語言模型（LALMs）仍屬離線模式，而串流音訊模型各自僅能處理單一任務，例如串流語音辨識或語音對話。現在是時候將它們統一為一個在線大型音訊語言模型（LALM）：一個透過永遠在線的感知－決策－回應迴圈，即時聆聽聲音、環境與指令，並當場做出反應的模型。我們將此模式正式定義為音訊互動模型，並以 Audio-Interaction 實現——這是一個統一的串流模型，既能保留離線任務執行能力，又能新增線上通用音訊指令遵循功能，涵蓋從對話到完整語音交談，並能從串流的語義中判斷何時回應。為了實現此目標，我們提出 SoundFlow 框架，該框架端到端地實例化感知－決策－回應迴圈，從資料、訓練到部署，透過原生串流資料建構、理解感知訓練，以及非同步低延遲推論，實現穩定的即時互動。我們進一步建構了 StreamAudio-2M，一個包含 2.6M 項目的串流語料庫，涵蓋 7 項基礎能力與 28 個子任務，以及用於評估主動式音訊干預的 Proactive-Sound-Bench。在 8 個基準測試中，Audio-Interaction 在主流音訊任務上保持競爭力，同時解鎖了離線 LALM 無法達到的能力，包括即時語音辨識、串流音訊指令遵循，以及主動式協助。

English

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.