OmniMMI:串流影片情境下的全方位多模態互動基準測試
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
March 29, 2025
作者: Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng
cs.AI
摘要
多模態語言模型(MLLMs)如GPT-4o的快速發展,推動了全能語言模型的進步,這些模型旨在處理並主動回應連續的多模態數據流。儘管其潛力巨大,但在流媒體視頻情境下評估其實際互動能力仍是一項艱鉅的挑戰。在本研究中,我們引入了OmniMMI,這是一個專為流媒體視頻情境下的全能語言模型(OmniLLMs)量身定制的全面多模態互動基準。OmniMMI涵蓋了超過1,121個視頻和2,290個問題,針對現有視頻基準中兩個關鍵但尚未充分探索的挑戰:流媒體視頻理解與主動推理,並分佈於六個不同的子任務中。此外,我們提出了一個新穎的框架——多模態多路復用建模(M4),旨在實現一個推理高效的流媒體模型,該模型能夠在生成過程中同時進行視覺與聽覺的感知。
English
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has
propelled the development of Omni language models, designed to process and
proactively respond to continuous streams of multi-modal data. Despite their
potential, evaluating their real-world interactive capabilities in streaming
video contexts remains a formidable challenge. In this work, we introduce
OmniMMI, a comprehensive multi-modal interaction benchmark tailored for
OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and
2,290 questions, addressing two critical yet underexplored challenges in
existing video benchmarks: streaming video understanding and proactive
reasoning, across six distinct subtasks. Moreover, we propose a novel
framework, Multi-modal Multiplexing Modeling (M4), designed to enable an
inference-efficient streaming model that can see, listen while generating.Summary
AI-Generated Summary