ChatPaper.aiChatPaper

TEMPURA:面向行動推理的時間事件掩碼預測與理解

TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

May 2, 2025
作者: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang
cs.AI

摘要

理解因果事件關係並在影片中實現細粒度時間定位,對於視覺語言模型而言仍具挑戰性。現有方法要麼壓縮影片標記以降低時間分辨率,要麼將影片視為未分割的流,這模糊了細粒度的事件邊界並限制了因果依賴性的建模。我們提出了TEMPURA(時間事件掩碼預測與理解以進行行動推理),這是一個兩階段訓練框架,旨在增強影片的時間理解能力。TEMPURA首先應用掩碼事件預測推理來重建缺失事件,並從密集事件註釋中生成逐步的因果解釋,借鑒了有效的填充技術。隨後,TEMPURA學習執行影片分割和密集描述,將影片分解為非重疊的事件,並提供詳細的時間戳對齊描述。我們在VER上訓練TEMPURA,這是一個由我們策劃的大規模數據集,包含100萬個訓練實例和50萬個影片,這些影片具有時間對齊的事件描述和結構化的推理步驟。在時間定位和亮點檢測基準上的實驗表明,TEMPURA優於強大的基線模型,證明了將因果推理與細粒度時間分割相結合能提升影片理解能力。
English
Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.

Summary

AI-Generated Summary

PDF61May 6, 2025