TEMPURA:面向行动推理的时间事件掩码预测与理解
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
May 2, 2025
作者: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang
cs.AI
摘要
理解视频中的因果事件关系并实现细粒度的时间定位,对于视觉语言模型而言仍具挑战性。现有方法要么通过压缩视频标记来降低时间分辨率,要么将视频视为未分割的流,这模糊了细粒度的事件边界,并限制了对因果依赖关系的建模。我们提出了TEMPURA(时序事件掩码预测与行动推理理解),这是一个两阶段训练框架,旨在增强视频时序理解能力。TEMPURA首先借鉴有效的填充技术,应用掩码事件预测推理来重建缺失事件,并从密集事件标注中生成逐步的因果解释。随后,TEMPURA学习执行视频分割和密集描述任务,将视频分解为不重叠的事件,并配以详细且时间戳对齐的描述。我们在VER数据集上训练TEMPURA,这是一个由我们整理的大规模数据集,包含100万训练实例和50万视频,这些视频均带有时间对齐的事件描述和结构化推理步骤。在时间定位和高光检测基准测试中的实验表明,TEMPURA超越了强大的基线模型,证实了将因果推理与细粒度时间分割相结合能够提升视频理解能力。
English
Understanding causal event relationships and achieving fine-grained temporal
grounding in videos remain challenging for vision-language models. Existing
methods either compress video tokens to reduce temporal resolution, or treat
videos as unsegmented streams, which obscures fine-grained event boundaries and
limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event
Masked Prediction and Understanding for Reasoning in Action), a two-stage
training framework that enhances video temporal understanding. TEMPURA first
applies masked event prediction reasoning to reconstruct missing events and
generate step-by-step causal explanations from dense event annotations, drawing
inspiration from effective infilling techniques. TEMPURA then learns to perform
video segmentation and dense captioning to decompose videos into
non-overlapping events with detailed, timestamp-aligned descriptions. We train
TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training
instances and 500K videos with temporally aligned event descriptions and
structured reasoning steps. Experiments on temporal grounding and highlight
detection benchmarks demonstrate that TEMPURA outperforms strong baseline
models, confirming that integrating causal reasoning with fine-grained temporal
segmentation leads to improved video understanding.