视频聊天：以聊天为中心的视频理解

摘要

在这项研究中，我们通过引入VideoChat系统，一个端到端以聊天为中心的视频理解系统，来探索视频理解。该系统通过可学习的神经接口将视频基础模型和大型语言模型整合在一起，在时空推理、事件定位和因果关系推断方面表现出色。为了有效地调整该系统，我们提出了一个以视频为中心的指导数据集，包含数千个视频、详细描述和对话。该数据集强调时空推理和因果关系，为训练以聊天为中心的视频理解系统提供了宝贵资料。初步的定性实验揭示了我们的系统在广泛的视频应用中的潜力，并为未来研究树立了标杆。访问我们的代码和数据：https://github.com/OpenGVLab/Ask-Anything

English

In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system's potential across a broad spectrum of video applications and set the standard for future research. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

视频聊天：以聊天为中心的视频理解

VideoChat: Chat-Centric Video Understanding

摘要

Support