Grounded-VideoLLM: ビデオ内の微細な時間的グラウンディングを強化する大規模言語モデル

要旨

ビデオ大規模言語モデル（Video-LLMs）は、粗い粒度のビデオ理解において顕著な能力を示していますが、細かい時間的な基盤には苦労しています。本論文では、細かい粒度でのビデオ瞬間の知覚と推論に長けた新しいVideo-LLMであるGrounded-VideoLLMを紹介します。現行のVideo-LLMsは、効果的な時間モデリングとタイムスタンプ表現を欠いているため、細かい粒度のビデオ理解に制約があることを特定します。この課題に対処するために、フレーム間の関係をエンコードするための追加の時間ストリームと、タイムスタンプを表現するための特定の時間知識で充実させた離散的な時間トークンを組み込むことで、モデルを洗練させます。Grounded-VideoLLMのトレーニングを最適化するために、段階的なトレーニングスキームを採用し、単純なビデオキャプショニングタスクから始め、徐々に複雑さが増すビデオ時間基盤タスクを導入しています。さらに、Grounded-VideoLLMの時間的推論能力をさらに向上させるために、自動アノテーションパイプラインによってグラウンディングされたVideoQAデータセットをキュレーションしています。幅広い実験により、Grounded-VideoLLMは、時間的な文の基盤、密なビデオキャプショニング、グラウンディングされたVideoQAなどの細かい基盤タスクで優れているだけでなく、一般的なビデオ理解のための多目的ビデオアシスタントとして大きな潜在能力を示しています。

English

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

Grounded-VideoLLM: ビデオ内の微細な時間的グラウンディングを強化する大規模言語モデル

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

要旨

Support