OmniInteract: 실시간 옴니모달 어시스턴트를 위한 실세계 스트리밍 상호작용 벤치마킹

초록

본 논문에서는 OmniInteract를 소개한다. 이는 오디오-비주얼 스트림에 대한 네이티브 온라인 추론을 통해 평가되는 실시간 옴니모달 대규모 언어 모델을 위한 스트리밍 벤치마크이다. 오프라인 비디오 이해나 텍스트 프롬프트 기반 스트리밍 QA와 달리, OmniInteract는 원본 오디오-비주얼 스트림을 유지하며 모델이 미래 콘텐츠에 접근하지 않고 온라인으로 처리할 것을 요구한다. 사용자 질의와 주변 소리는 오디오 트랙에 내장되어 있어, 모델이 멀티모달 트리거를 감지하고, 응답 시점을 결정하며, 스트림이 전개되는 동안 답변해야 한다. OmniInteract는 250개의 비디오와 1,430개의 시간적으로 고정된 응답 슬롯을 포함한다: 실시간, 능동적, 그리고 중첩 시나리오에 걸친 1,062개의 1Q1A 슬롯과 연속적인 작업 모니터링 및 단계 안내를 위한 368개의 1QnA 슬롯이 있다. 각 슬롯에는 트리거, 응답 윈도우, 그리고 목표 답변이 포함된다. 응답 정확성, 타이밍, 무효 출력, 중단 처리, 그리고 맥락 연속성을 상호작용 인식 품질-적시성 F1, 중단 진단 스위트, 중첩 체인 완료 점수를 사용하여 평가한다. 실험 결과, 현재 모델은 스트리밍 상호작용에서 여전히 취약하며, 최고 전체 IA-QTF1은 0.368에 불과하고 최고 1QnA IA-QTF1은 0.052에 그친다. 전이중 설정에서의 수학적 추론에 대한 추가 연구는 오프라인 능력이 반드시 온라인 상호작용으로 전이되지 않음을 보여준다. 코드와 데이터셋은 https://github.com/Lucky-Lance/OmniInteract에서 공개적으로 접근 가능할 것이다.

English

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.