Video-ChatGPT：透過大視覺和語言模型朝向詳細視頻理解

摘要

由大型語言模型（LLMs）驅動的對話代理為與視覺數據互動提供了一種新方式。儘管已經有了基於圖像的對話模型的初步嘗試，但本研究探討了引入Video-ChatGPT的基於視頻的對話領域。這是一個多模型模型，將視頻適應的視覺編碼器與LLM相結合。該模型能夠理解並生成關於視頻的人類對話。我們引入了一個新的數據集，包含10萬個視頻指令對，用於訓練Video-ChatGPT，通過手動和半自動化流程獲取，易於擴展並對標籤噪聲具有強韌性。我們還為基於視頻的對話模型開發了一個定量評估框架，以客觀分析所提出模型的優勢和劣勢。我們的代碼、模型、指令集和演示可在https://github.com/mbzuai-oryx/Video-ChatGPT 上找到。

English

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.

Video-ChatGPT：透過大視覺和語言模型朝向詳細視頻理解

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

摘要

Support