MM-VID:利用GPT-4V(ision)推动视频理解
MM-VID: Advancing Video Understanding with GPT-4V(ision)
October 30, 2023
作者: Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang
cs.AI
摘要
我们提出了MM-VID,这是一个集成系统,利用GPT-4V的能力,结合视觉、音频和语音等专门工具,促进高级视频理解。MM-VID旨在解决长视频和复杂任务带来的挑战,如在长达一小时的内容中进行推理和理解跨多集的故事情节。MM-VID利用视频到脚本生成与GPT-4V结合,将多模态元素转录为长文本脚本。生成的脚本详细描述了角色的移动、动作、表情和对话,为大型语言模型(LLMs)实现视频理解铺平了道路。这使得实现高级功能成为可能,包括音频描述、角色识别和多模态高层次理解。实验结果展示了MM-VID在处理不同视频类型和不同长度视频时的有效性。此外,我们展示了将其应用于交互环境(如视频游戏和图形用户界面)时的潜力。
English
We present MM-VID, an integrated system that harnesses the capabilities of
GPT-4V, combined with specialized tools in vision, audio, and speech, to
facilitate advanced video understanding. MM-VID is designed to address the
challenges posed by long-form videos and intricate tasks such as reasoning
within hour-long content and grasping storylines spanning multiple episodes.
MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal
elements into a long textual script. The generated script details character
movements, actions, expressions, and dialogues, paving the way for large
language models (LLMs) to achieve video understanding. This enables advanced
capabilities, including audio description, character identification, and
multimodal high-level comprehension. Experimental results demonstrate the
effectiveness of MM-VID in handling distinct video genres with various video
lengths. Additionally, we showcase its potential when applied to interactive
environments, such as video games and graphic user interfaces.Summary
AI-Generated Summary