長時間ビデオと言語理解のための時間スケールビデオトレーニングの解放

要旨

近年の長時間動画と言語理解のベンチマークは、ビデオ大規模マルチモーダルモデル（Video-LMMs）の進展を促進してきた。しかし、十分に注釈付けされた長時間動画の不足により、1時間規模のビデオ大規模言語モデル（Video-LLMs）の訓練は十分に検討されていない。このギャップを埋めるため、我々は大規模な1時間規模の動画指示追従データセットであるVideoMarathonを提案する。このデータセットは、3分から60分までの多様なドメインから収集された約9,700時間の長時間動画を含む。具体的には、時間性、空間性、物体、行動、シーン、イベントという6つの基本トピックにまたがる330万の高品質なQAペアを提供する。既存の動画指示データセットと比較して、VideoMarathonは訓練動画の時間を最大1時間まで大幅に拡張し、短期および長期の動画理解を必要とする22の多様なタスクをサポートする。VideoMarathonを基盤として、我々は1時間規模の動画と言語モデリングのための強力で効率的なVideo-LMMであるHour-LLaVAを提案する。これは、メモリ拡張モジュールを活用して、1-FPSのサンプリングで1時間の動画訓練と推論を可能にする。このモジュールは、キャッシュされた全動画コンテキストからユーザーの質問に関連し、時空間的に情報量の多い意味を適応的に統合する。実験では、Hour-LLaVAは複数の長時間動画と言語ベンチマークで最高の性能を達成し、VideoMarathonデータセットの高品質とHour-LLaVAモデルの優位性を実証した。

English

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

長時間ビデオと言語理解のための時間スケールビデオトレーニングの解放

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

要旨

Support