視訊聊天：以聊天為中心的視訊理解

摘要

在這項研究中，我們通過引入VideoChat系統，一個以聊天為中心的端到端視頻理解系統，來探索視頻理解。該系統通過可學習的神經接口將視頻基礎模型和大型語言模型相結合，在時空推理、事件定位和因果關係推斷方面表現出色。為了有針對性地調整這個系統，我們提出了一個以視頻為中心的指導數據集，包含數千個視頻，配有詳細描述和對話。這個數據集強調時空推理和因果關係，為訓練以聊天為中心的視頻理解系統提供了寶貴資源。初步的定性實驗揭示了我們系統在各種視頻應用中的潛力，並為未來研究設定了標準。請訪問我們的代碼和數據，網址為https://github.com/OpenGVLab/Ask-Anything

English

In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system's potential across a broad spectrum of video applications and set the standard for future research. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

視訊聊天：以聊天為中心的視訊理解

VideoChat: Chat-Centric Video Understanding

摘要

Support