Video-ChatGPT:通过大型视觉和语言模型实现详细视频理解。
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
June 8, 2023
作者: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
cs.AI
摘要
由大型语言模型(LLMs)驱动的对话代理为与视觉数据交互提供了一种新方式。虽然已经尝试过基于图像的对话模型,但本研究致力于介绍一种新颖的基于视频的对话模型,即Video-ChatGPT。这是一种多模态模型,将经过视频调整的视觉编码器与语言模型相结合。该模型能够理解并生成关于视频的类人对话。我们引入了一个新的数据集,包含10万个视频-指令对,用于训练Video-ChatGPT,这些数据通过手动和半自动化流程获取,易于扩展且对标签噪声具有鲁棒性。我们还为基于视频的对话模型开发了一个定量评估框架,以客观分析所提出模型的优势和劣势。我们的代码、模型、指令集和演示可在https://github.com/mbzuai-oryx/Video-ChatGPT 上获取。
English
Conversation agents fueled by Large Language Models (LLMs) are providing a
new way to interact with visual data. While there have been initial attempts
for image-based conversation models, this work addresses the underexplored
field of video-based conversation by introducing Video-ChatGPT. It is a
multimodal model that merges a video-adapted visual encoder with a LLM. The
model is capable of understanding and generating human-like conversations about
videos. We introduce a new dataset of 100,000 video-instruction pairs used to
train Video-ChatGPT acquired via manual and semi-automated pipeline that is
easily scalable and robust to label noise. We also develop a quantiative
evaluation framework for video-based dialogue models to objectively analyse the
strengths and weaknesses of proposed models. Our code, models, instruction-sets
and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.