いつ・何を：エンティティ認識セグメンテーションを備えた拡散基盤型VideoLLMによる長尺動画理解

要旨

動画を理解するには、単に自由回答形式の質問に答えるだけでなく、イベントがいつ発生し、時間の経過とともにエンティティがどのように相互作用するかを特定する能力が求められます。最近のビデオLLMは、全体的な推論において顕著な進歩を遂げていますが、時間的知覚に関してはまだ粗い状態です。タイムスタンプは暗黙的にエンコードされ、フレームレベルの特徴は連続性を捉えるのに弱く、言語と視覚のアラインメントはしばしば対象となるエンティティからずれてしまいます。本論文では、これらの制限を克服するために設計されたビデオLLMであるGrounded VideoDiTを紹介します。このモデルは、3つの主要な革新を導入しています。第一に、Diffusion Temporal Latent (DTL)エンコーダが境界感度を強化し、時間的一貫性を維持します。第二に、オブジェクトに基づいた表現がクエリエンティティを局所的な視覚的証拠に明示的に結びつけ、アラインメントを強化します。第三に、離散的な時間トークンを含む混合トークンスキームが明示的なタイムスタンプモデリングを提供し、細かい時間的推論を可能にします。これらの設計を組み合わせることで、Grounded VideoDiTは強力なグラウンディング能力を備えており、Charades STA、NExT GQA、および複数のVideoQAベンチマークにおいて最先端の結果によってその有効性が検証されています。

English

Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.

いつ・何を：エンティティ認識セグメンテーションを備えた拡散基盤型VideoLLMによる長尺動画理解

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

要旨

Support