Video-ChatGPT: 대규모 시각 및 언어 모델을 통한 세밀한 비디오 이해를 향하여

초록

대형 언어 모델(LLMs)에 기반한 대화 에이전트는 시각 데이터와 상호작용하는 새로운 방식을 제공하고 있습니다. 이미지 기반 대화 모델에 대한 초기 시도들이 있었지만, 본 연구는 비디오 기반 대화라는 덜 탐구된 분야를 다루며 Video-ChatGPT를 소개합니다. 이는 비디오에 적응된 시각 인코더와 LLM을 결합한 멀티모달 모델로, 비디오에 대한 인간과 같은 대화를 이해하고 생성할 수 있습니다. 우리는 수동 및 반자동화 파이프라인을 통해 획득한 100,000개의 비디오-지시 쌍으로 구성된 새로운 데이터셋을 소개하며, 이는 쉽게 확장 가능하고 레이블 노이즈에 강건합니다. 또한, 제안된 모델의 강점과 약점을 객관적으로 분석하기 위해 비디오 기반 대화 모델을 위한 정량적 평가 프레임워크를 개발했습니다. 우리의 코드, 모델, 지시 세트 및 데모는 https://github.com/mbzuai-oryx/Video-ChatGPT에서 공개되었습니다.

English

Conversation agents fueled by Large Language Models (LLMs) are providing a new way to interact with visual data. While there have been initial attempts for image-based conversation models, this work addresses the underexplored field of video-based conversation by introducing Video-ChatGPT. It is a multimodal model that merges a video-adapted visual encoder with a LLM. The model is capable of understanding and generating human-like conversations about videos. We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT acquired via manual and semi-automated pipeline that is easily scalable and robust to label noise. We also develop a quantiative evaluation framework for video-based dialogue models to objectively analyse the strengths and weaknesses of proposed models. Our code, models, instruction-sets and demo are released at https://github.com/mbzuai-oryx/Video-ChatGPT.

Video-ChatGPT: 대규모 시각 및 언어 모델을 통한 세밀한 비디오 이해를 향하여

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

초록

Support