Koala: キーフレーム条件付き長尺ビデオLLM

要旨

長時間動画質問応答は、短期的な活動を認識し、それらの細かな関係性を推論するという挑戦的な課題です。最先端のビデオ大規模言語モデル（vLLM）は、新たなタスクに対する創発的な能力を示しており、有望な解決策として期待されています。しかし、数百万の短秒単位の動画で学習されているにもかかわらず、vLLMは数分単位の動画を理解し、それに関する質問に正確に答えることができません。この制限を解決するため、我々は軽量で自己教師ありのアプローチである「Key frame-conditioned long video-LLM（Koala）」を提案します。このアプローチでは、事前学習済みのvLLMを長時間動画に適応させるために、学習可能な時空間クエリを導入します。我々の手法は、短時間および長時間の動画の瞬間を理解するために、スパースな動画キーフレームから計算された視覚トークンに基づく2つの新しいトークナイザーを導入します。提案手法をHowTo100Mで学習し、ゼロショットの長時間動画理解ベンチマークでその有効性を実証しました。その結果、すべてのタスクにおいて、最先端の大規模モデルを3～6%の絶対精度で上回りました。驚くべきことに、我々のアプローチは、事前学習済みのvLLMが長時間動画を理解するだけでなく、短期的な行動認識の精度も向上させることを経験的に示しました。

English

Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation, we propose a lightweight and self-supervised approach, Key frame-conditioned long video-LLM (Koala), that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks, where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.

Koala: キーフレーム条件付き長尺ビデオLLM

Koala: Key frame-conditioned long video-LLM

要旨

Support