OmniInteract: 実世界ストリーミング対話を用いたリアルタイム全モーダルアシスタントのベンチマーク

要旨

我々は、音声・映像ストリームに対するネイティブなオンライン推論を通じて評価される、リアルタイム全モーダル大規模言語モデルのためのストリーミングベンチマークであるOmniInteractを紹介する。オフラインの動画理解やテキストプロンプトによるストリーミングQAとは異なり、OmniInteractは元の音声・映像ストリームを保持し、モデルが将来のコンテンツにアクセスせずにそれをオンラインで処理することを要求する。ユーザのクエリや環境音は音声トラックに埋め込まれており、モデルはマルチモーダルトリガーを検出し、応答するタイミングを決定し、ストリームが進行する中で回答する必要がある。 OmniInteractは、時間的に根拠づけられた応答スロットを1,430個含む250本のビデオで構成される。その内訳は、リアルタイム、プロアクティブ、ネストされたシナリオにわたる1,062個の1Q1Aスロットと、継続的なタスク監視とステップガイダンスのための368個の1QnAスロットである。各スロットには、トリガー、応答ウィンドウ、目標回答が含まれる。我々は、応答の正確性、タイミング、無効な出力、割り込み処理、コンテキストの継続性を、Interaction-Aware Quality-Timeliness F1（IA-QTF1）、Interruption Diagnostic Suite、およびNested Chain Completion Scoreを用いて評価する。実験結果は、現在のモデルがストリーミング対話において依然として弱く、最高の全体的なIA-QTF1が0.368、最高の1QnA IA-QTF1が0.052にとどまることを示している。全二重設定での数学的推論に関するさらなる研究は、オフラインの能力が必ずしもオンライン対話に転移しないことを示している。コードとデータセットは、https://github.com/Lucky-Lance/OmniInteract で公開される予定である。

English

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.