MM-VID：利用GPT-4V(ision)推進視頻理解

摘要

我們提出了MM-VID，這是一個整合系統，結合了GPT-4V的能力，並搭配視覺、音訊和語音等專業工具，以促進高級視頻理解。MM-VID旨在應對長篇視頻和複雜任務所帶來的挑戰，例如在長達一小時的內容中進行推理和理解跨越多集的故事情節。MM-VID使用視頻轉腳本生成與GPT-4V，將多模態元素轉錄為一份長文本腳本。生成的腳本詳細描述了角色的移動、動作、表情和對話，為大型語言模型（LLMs）實現視頻理解鋪平了道路。這使得實現了高級功能，包括音訊描述、角色識別和多模態高層次理解。實驗結果展示了MM-VID在處理不同視頻類型和不同長度視頻時的有效性。此外，我們展示了當應用於互動環境時，例如視頻遊戲和圖形用戶界面，其潛力。

English

We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding. MM-VID is designed to address the challenges posed by long-form videos and intricate tasks such as reasoning within hour-long content and grasping storylines spanning multiple episodes. MM-VID uses a video-to-script generation with GPT-4V to transcribe multimodal elements into a long textual script. The generated script details character movements, actions, expressions, and dialogues, paving the way for large language models (LLMs) to achieve video understanding. This enables advanced capabilities, including audio description, character identification, and multimodal high-level comprehension. Experimental results demonstrate the effectiveness of MM-VID in handling distinct video genres with various video lengths. Additionally, we showcase its potential when applied to interactive environments, such as video games and graphic user interfaces.

MM-VID：利用GPT-4V(ision)推進視頻理解

MM-VID: Advancing Video Understanding with GPT-4V(ision)

摘要

Support