何時與何物：基於擴散模型的VideoLLM與實體感知分割技術在長視頻理解中的應用

摘要

理解影片不僅需要回答開放式問題，更要求具備精確定位事件發生時間及實體間跨時空互動的能力。儘管近期影片大型語言模型（Video LLMs）在整體推理方面取得了顯著進展，但其在時間感知上仍顯粗糙：時間戳僅被隱含編碼，幀級特徵在捕捉連續性方面表現薄弱，且語言與視覺的對齊常偏離關注的實體。本文提出Grounded VideoDiT，這是一款旨在克服上述限制的影片大型語言模型，其引入了三項關鍵創新。首先，擴散時間潛在（DTL）編碼器增強了邊界敏感性並保持了時間一致性。其次，基於物體的表示法將查詢實體明確綁定至局部視覺證據，強化了對齊效果。第三，採用包含離散時間標記的混合標記方案，實現了精確的時間戳建模，從而支持細粒度的時間推理。這些設計共同賦予Grounded VideoDiT強大的定位能力，並在Charades STA、NExT GQA及多個影片問答基準測試中取得了領先的驗證成果。

English

Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.

何時與何物：基於擴散模型的VideoLLM與實體感知分割技術在長視頻理解中的應用

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

摘要

Support