OmniMMI：串流影片情境下的全方位多模態互動基準測試

摘要

多模態語言模型（MLLMs）如GPT-4o的快速發展，推動了全能語言模型的進步，這些模型旨在處理並主動回應連續的多模態數據流。儘管其潛力巨大，但在流媒體視頻情境下評估其實際互動能力仍是一項艱鉅的挑戰。在本研究中，我們引入了OmniMMI，這是一個專為流媒體視頻情境下的全能語言模型（OmniLLMs）量身定制的全面多模態互動基準。OmniMMI涵蓋了超過1,121個視頻和2,290個問題，針對現有視頻基準中兩個關鍵但尚未充分探索的挑戰：流媒體視頻理解與主動推理，並分佈於六個不同的子任務中。此外，我們提出了一個新穎的框架——多模態多路復用建模（M4），旨在實現一個推理高效的流媒體模型，該模型能夠在生成過程中同時進行視覺與聽覺的感知。

English

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

OmniMMI：串流影片情境下的全方位多模態互動基準測試

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

摘要

Support