TEMPURA：面向行动推理的时间事件掩码预测与理解

摘要

理解视频中的因果事件关系并实现细粒度的时间定位，对于视觉语言模型而言仍具挑战性。现有方法要么通过压缩视频标记来降低时间分辨率，要么将视频视为未分割的流，这模糊了细粒度的事件边界，并限制了对因果依赖关系的建模。我们提出了TEMPURA（时序事件掩码预测与行动推理理解），这是一个两阶段训练框架，旨在增强视频时序理解能力。TEMPURA首先借鉴有效的填充技术，应用掩码事件预测推理来重建缺失事件，并从密集事件标注中生成逐步的因果解释。随后，TEMPURA学习执行视频分割和密集描述任务，将视频分解为不重叠的事件，并配以详细且时间戳对齐的描述。我们在VER数据集上训练TEMPURA，这是一个由我们整理的大规模数据集，包含100万训练实例和50万视频，这些视频均带有时间对齐的事件描述和结构化推理步骤。在时间定位和高光检测基准测试中的实验表明，TEMPURA超越了强大的基线模型，证实了将因果推理与细粒度时间分割相结合能够提升视频理解能力。

English

Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.

TEMPURA：面向行动推理的时间事件掩码预测与理解

TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

摘要

Support