VideoChat: チャット中心のビデオ理解

要旨

本研究では、ビデオ理解の探求を開始するため、エンドツーエンドのチャット中心型ビデオ理解システム「VideoChat」を導入します。このシステムは、学習可能なニューラルインターフェースを介してビデオ基盤モデルと大規模言語モデルを統合し、時空間推論、イベント位置特定、因果関係推論において優れた性能を発揮します。このシステムを効果的に調整するため、数千のビデオと詳細な説明や会話を組み合わせたビデオ中心の指示データセットを提案します。このデータセットは、時空間推論と因果関係に重点を置いており、チャット中心型ビデオ理解システムのトレーニングに貴重な資産を提供します。予備的な定性的実験により、本システムが幅広いビデオアプリケーションにおいて潜在能力を発揮し、将来の研究の基準を設定することが明らかになりました。コードとデータはhttps://github.com/OpenGVLab/Ask-Anythingで公開しています。

English

In this study, we initiate an exploration into video understanding by introducing VideoChat, an end-to-end chat-centric video understanding system. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we propose a video-centric instruction dataset, composed of thousands of videos matched with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and causal relationships, providing a valuable asset for training chat-centric video understanding systems. Preliminary qualitative experiments reveal our system's potential across a broad spectrum of video applications and set the standard for future research. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

VideoChat: チャット中心のビデオ理解

VideoChat: Chat-Centric Video Understanding

要旨

Support