ChatPaper.aiChatPaper

Video-ChatGPT:透過大視覺和語言模型朝向詳細視頻理解

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

June 8, 2023
作者: Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
cs.AI

摘要

由大型語言模型(LLMs)驅動的對話代理為與視覺數據互動提供了一種新方式。儘管已經有了基於圖像的對話模型的初步嘗試,但本研究探討了引入Video-ChatGPT的基於視頻的對話領域。這是一個多模型模型,將視頻適應的視覺編碼器與LLM相結合。該模型能夠理解並生成關於視頻的人類對話。我們引入了一個新的數據集,包含10萬個視頻指令對,用於訓練Video-ChatGPT,通過手動和半自動化流程獲取,易於擴展並對標籤噪聲具有強韌性。我們還為基於視頻的對話模型開發了一個定量評估框架,以客觀分析所提出模型的優勢和劣勢。我們的代碼、模型、指令集和演示可在https://github.com/mbzuai-oryx/Video-ChatGPT 上找到。
English
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.
PDF71December 15, 2024