OmniInteract：面向实时全模态助手的真实世界流式交互基准评估

摘要

我们提出了 OmniInteract，一个用于实时全模态大语言模型的流式基准测试，评估方式是通过对音频-视频流进行原生在线推理。与离线视频理解或基于文本提示的流式问答不同，OmniInteract 保留了原始的音频-视频流，要求模型在线处理数据，而无法访问未来内容。用户查询和环境声音被嵌入音频轨道中，模型需检测多模态触发条件，自行决定何时响应，并在流式播放过程中回答问题。OmniInteract 包含 250 个视频，共 1430 个时间锚定响应槽：其中 1062 个为单问单答（1Q1A）槽，涵盖实时、主动和嵌套场景；另外 368 个为单问多答（1QnA）槽，用于连续任务监控和步骤指导。每个响应槽均包含触发条件、响应窗口和目标答案。我们采用交互感知质量-及时性 F1（IA-QTF1）、中断诊断套件以及嵌套链完成分数，评估响应正确性、时机、无效输出、中断处理以及上下文连续性。实验结果表明，当前模型在流式交互方面仍然较弱，最佳整体 IA-QTF1 仅为 0.368，最佳 1QnA IA-QTF1 仅为 0.052。进一步对全双工场景下的数学推理研究发现，离线能力并不一定能迁移到在线交互中。代码与数据集将在 https://github.com/Lucky-Lance/OmniInteract 公开。

English

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.