VideoChat: 채팅 중심 비디오 이해

초록

본 연구에서는 비디오 이해에 대한 탐구를 시작하며, 종단 간(end-to-end) 채팅 중심 비디오 이해 시스템인 VideoChat을 소개합니다. 이 시스템은 학습 가능한 신경 인터페이스를 통해 비디오 기반 모델과 대형 언어 모델을 통합하며, 시공간적 추론, 이벤트 위치 파악, 그리고 인과 관계 추론에서 뛰어난 성능을 보입니다. 이 시스템을 효과적으로 조정하기 위해, 우리는 수천 개의 비디오와 상세한 설명 및 대화를 매칭한 비디오 중심 명령어 데이터셋을 제안합니다. 이 데이터셋은 시공간적 추론과 인과 관계에 중점을 두어, 채팅 중심 비디오 이해 시스템을 훈련하는 데 유용한 자원을 제공합니다. 예비 질적 실험을 통해 우리 시스템이 다양한 비디오 응용 분야에서의 잠재력을 보여주며, 향후 연구를 위한 기준을 제시합니다. 코드와 데이터는 https://github.com/OpenGVLab/Ask-Anything에서 확인할 수 있습니다.

English

In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system's potential across a broad spectrum of video applications and set the standard for future research. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

VideoChat: 채팅 중심 비디오 이해

VideoChat: Chat-Centric Video Understanding

초록

Support