OmniInteract: 針對即時全模態助手的真實世界串流互動基準評測

摘要

我们提出了OmniInteract，这是一个面向实时全模态大语言模型的流式基准测试，通过原生在线推理对音视频流进行评估。与离线视频理解或基于文本提示的流式问答不同，OmniInteract保留了原始的音视频流，并要求模型在线处理，无法访问未来内容。用户查询和周围环境声音嵌入在音频轨道中，需要模型检测多模态触发信号，决定何时响应，并在流式过程中作答。OmniInteract包含250个视频，共有1430个时间锚定的响应插槽：其中1062个1Q1A插槽覆盖了实时、主动和嵌套场景，368个1QnA插槽用于连续任务监控和逐步指导。每个插槽包含触发信号、响应窗口和目标答案。我们使用交互感知质量-时效性F1分数、中断诊断套件和嵌套链完成分数来评估回答正确性、时序、无效输出、中断处理以及上下文连续性。实验表明，当前模型在流式交互中仍然较弱，最佳整体IA-QTF1仅达到0.368，最佳1QnA的IA-QTF1仅0.052。进一步对全双工设置下的数学推理能力研究表明，离线能力并不一定能迁移到在线交互中。代码和数据集将在 https://github.com/Lucky-Lance/OmniInteract 公开提供。

English

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.