音频交互模型

摘要

音频本质上是一种交互式模态，但如今的大型音频语言模型（LALMs）均为离线模型，而流式音频模型各自仅处理单一任务，例如流式语音识别或语音聊天。现在应当将它们统一为一个在线LALM：该模型通过始终在线的“感知-决策-响应”循环，实时聆听声音、环境与指令，并即时做出反应。我们将这一机制正式定义为“音频交互模型”，并通过Audio-Interaction模型实现——一个统一的流式模型，既能保留离线任务执行能力，又能新增在线通用音频指令跟随功能，涵盖从对话到全语音聊天的各类场景，并根据数据流的语义决定何时做出响应。为支持这一目标，我们提出SoundFlow框架，该框架端到端地实现了“感知-决策-响应”循环，从数据构建、训练到部署，均采用原生流式数据处理、理解感知式训练以及异步低延迟推理，以支持稳定的实时交互。此外，我们构建了StreamAudio-2M，一个包含260万条数据的流式语料库，覆盖7项基础能力与28个子任务；并构建了Proactive-Sound-Bench以评估主动音频干预能力。在8个基准测试中，Audio-Interaction在主流音频任务上保持了具有竞争力的性能，同时解锁了离线LALM无法实现的能力，包括实时语音识别、流式音频指令跟随以及主动协助。

English

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.