何时与何物：基于扩散模型与实体感知分割的视频大语言模型，助力长视频理解

摘要

理解视频不仅需要回答开放式问题，更要求具备精确定位事件发生时间及实体间随时间交互的能力。尽管近期的视频大语言模型（Video LLMs）在整体推理方面取得了显著进展，但其在时间感知上仍显粗糙：时间戳仅被隐式编码，帧级特征在捕捉连续性上表现不足，且语言与视觉的对齐常偏离关注实体。本文提出Grounded VideoDiT，一款旨在克服上述局限的视频大语言模型，通过引入三项关键创新：首先，扩散时间潜在编码器（DTL）增强了边界敏感性并保持时间一致性；其次，基于对象的表征明确将查询实体与局部视觉证据绑定，强化了对齐效果；最后，结合离散时间标记的混合标记方案实现了显式时间戳建模，支持细粒度时间推理。这些设计共同赋予了Grounded VideoDiT强大的定位能力，其在Charades STA、NExT GQA及多项视频问答基准测试中的最新成果验证了这一点。

English

Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.

何时与何物：基于扩散模型与实体感知分割的视频大语言模型，助力长视频理解

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

摘要

Support