Video-ChatGPT：通过大型视觉和语言模型实现详细视频理解。

摘要

由大型语言模型（LLMs）驱动的对话代理为与视觉数据交互提供了一种新方式。虽然已经尝试过基于图像的对话模型，但本研究致力于介绍一种新颖的基于视频的对话模型，即Video-ChatGPT。这是一种多模态模型，将经过视频调整的视觉编码器与语言模型相结合。该模型能够理解并生成关于视频的类人对话。我们引入了一个新的数据集，包含10万个视频-指令对，用于训练Video-ChatGPT，这些数据通过手动和半自动化流程获取，易于扩展且对标签噪声具有鲁棒性。我们还为基于视频的对话模型开发了一个定量评估框架，以客观分析所提出模型的优势和劣势。我们的代码、模型、指令集和演示可在https://github.com/mbzuai-oryx/Video-ChatGPT 上获取。

English

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.

Video-ChatGPT：通过大型视觉和语言模型实现详细视频理解。

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

摘要

Support