ChatPaper.aiChatPaper

OmniMMI:串流影片情境下的全方位多模態互動基準測試

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

March 29, 2025
作者: Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng
cs.AI

摘要

多模態語言模型(MLLMs)如GPT-4o的快速發展,推動了全能語言模型的進步,這些模型旨在處理並主動回應連續的多模態數據流。儘管其潛力巨大,但在流媒體視頻情境下評估其實際互動能力仍是一項艱鉅的挑戰。在本研究中,我們引入了OmniMMI,這是一個專為流媒體視頻情境下的全能語言模型(OmniLLMs)量身定制的全面多模態互動基準。OmniMMI涵蓋了超過1,121個視頻和2,290個問題,針對現有視頻基準中兩個關鍵但尚未充分探索的挑戰:流媒體視頻理解與主動推理,並分佈於六個不同的子任務中。此外,我們提出了一個新穎的框架——多模態多路復用建模(M4),旨在實現一個推理高效的流媒體模型,該模型能夠在生成過程中同時進行視覺與聽覺的感知。
English
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

Summary

AI-Generated Summary

PDF182April 2, 2025