OmniInteract: Benchmarking realer Streaming-Interaktion für omnimodale Echtzeit-Assistenten

Zusammenfassung

Wir stellen OmniInteract vor, einen Streaming-Benchmark für Echtzeit-omnimodale große Sprachmodelle, bewertet durch native Online-Inferenz über audiovisuelle Streams. Im Gegensatz zu Offline-Videoverständnis oder textgesteuerter Streaming-QA bewahrt OmniInteract den ursprünglichen audiovisuellen Stream und verlangt von den Modellen, ihn online zu verarbeiten, ohne Zugriff auf zukünftige Inhalte. Benutzeranfragen und Umgebungsgeräusche sind in der Audiospur eingebettet, sodass Modelle multimodale Auslöser erkennen, entscheiden müssen, wann sie antworten, und während des sich entfaltenden Streams antworten müssen. OmniInteract enthält 250 Videos mit 1.430 zeitlich verankerten Antwortslots: 1.062 1Q1A-Slots in Echtzeit-, proaktiven und verschachtelten Szenarien sowie 368 1QnA-Slots für kontinuierliche Aufgabenüberwachung und Schritt-für-Schritt-Anleitung. Jeder Slot umfasst einen Auslöser, ein Antwortfenster und eine Zielantwort. Wir bewerten Antwortkorrektheit, Timing, ungültige Ausgaben, Unterbrechungshandhabung und Kontextkontinuität mithilfe des Interaction-Aware Quality-Timeliness F1, der Interruption Diagnostic Suite und des Nested Chain Completion Score. Experimente zeigen, dass aktuelle Modelle in der Streaming-Interaktion schwach bleiben, wobei der beste Gesamt-IA-QTF1 nur 0,368 und der beste 1QnA-IA-QTF1 nur 0,052 erreicht. Eine weitere Studie zum mathematischen Denken in Vollduplex-Umgebungen zeigt, dass Offline-Fähigkeiten nicht unbedingt auf Online-Interaktion übertragbar sind. Code und Datensätze werden unter https://github.com/Lucky-Lance/OmniInteract öffentlich zugänglich gemacht.

English

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.