OmniInteract: Het benchmarken van real-world streaminginteractie voor real-time omnimodale assistenten

Samenvatting

Wij introduceren OmniInteract, een streaming-benchmark voor real-time omnimodale grote taalmodellen die wordt geëvalueerd via native online inferentie op audiovisuele stromen. In tegenstelling tot offline videobegrip of door tekst geprompte streaming-V&A, behoudt OmniInteract de oorspronkelijke audiovisuele stroom en vereist het dat modellen deze online verwerken, zonder toegang tot toekomstige inhoud. Gebruikersvragen en omgevingsgeluiden zijn ingebed in het audiospoor, waardoor modellen multimodale triggers moeten detecteren, moeten beslissen wanneer te reageren, en moeten antwoorden terwijl de stroom zich ontvouwt. OmniInteract bevat 250 video's met 1.430 temporeel gefundeerde responssleuven: 1.062 1Q1A-sleuven in real-time, proactieve en geneste scenario's, en 368 1QnA-sleuven voor continue taakmonitoring en stapsgewijze begeleiding. Elke sleuf omvat een trigger, een responsvenster en een doelantwoord. Wij evalueren antwoordcorrectheid, timing, ongeldige uitvoer, onderbrekingsafhandeling en contextcontinuïteit met behulp van de Interactiebewuste Kwaliteit-Tijdigheid F1, het Onderbrekingsdiagnostisch Pakket en de Geneste Ketenafrondingsscore. Experimenten tonen aan dat huidige modellen zwak blijven in streaming-interactie, waarbij de beste algemene IA-QTF1 slechts 0,368 bedraagt en de beste 1QnA IA-QTF1 slechts 0,052. Verder onderzoek naar wiskundig redeneren in full-duplex-omgevingen laat zien dat off-line capaciteit niet noodzakelijk overgaat naar online interactie. Code en datasets worden openbaar toegankelijk gemaakt op https://github.com/Lucky-Lance/OmniInteract.

English

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.