OmniMMI: ストリーミングビデオコンテキストにおける包括的なマルチモーダルインタラクションベンチマーク

要旨

GPT-4oのようなマルチモーダル言語モデル（MLLMs）の急速な進化により、継続的なマルチモーダルデータのストリームを処理し、積極的に対応することを目的としたOmni言語モデルの開発が進んでいます。その潜在能力にもかかわらず、ストリーミングビデオの文脈における現実世界のインタラクティブ能力を評価することは依然として大きな課題です。本研究では、ストリーミングビデオの文脈におけるOmniLLMs向けに設計された包括的なマルチモーダルインタラクションベンチマークであるOmniMMIを紹介します。OmniMMIは、1,121本以上のビデオと2,290の質問を含み、既存のビデオベンチマークで未開拓の2つの重要な課題、すなわちストリーミングビデオの理解と積極的推論を、6つの異なるサブタスクにわたって取り扱います。さらに、生成しながら見て聞くことができる推論効率の高いストリーミングモデルを実現するための新しいフレームワーク、マルチモーダル多重化モデリング（M4）を提案します。

English

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

OmniMMI: ストリーミングビデオコンテキストにおける包括的なマルチモーダルインタラクションベンチマーク

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

要旨

Support