ChatPaper.aiChatPaper

TRACE:通过因果事件建模对视频LLM进行时间定位

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

October 8, 2024
作者: Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Qingbin Liu, Xi Chen
cs.AI

摘要

视频时间定位(VTG)是视频理解模型的关键能力,在视频浏览和编辑等下游任务中发挥着重要作用。为了有效地同时处理各种任务并实现零样本预测,目前越来越多地采用视频LLMs来进行VTG任务。然而,目前基于视频LLM的方法仅依赖自然语言生成,缺乏对视频固有结构的建模能力,这限制了它们在处理VTG任务中的有效性。为了解决这个问题,本文首先正式引入因果事件建模框架,将视频表示为事件序列,并利用先前事件、视频输入和文本指令来预测当前事件。每个事件包括三个组成部分:时间戳、显著分数和文本说明。然后,我们提出了一种新颖的任务交织视频LLM,称为TRACE,以有效地实现因果事件建模框架。TRACE将视觉帧、时间戳、显著分数和文本作为不同任务进行处理,为每个任务使用各种编码器和解码头。任务令牌根据因果事件建模框架的公式排列成交织序列。对各种VTG任务和数据集的大量实验证明了TRACE相对于最先进的视频LLMs具有卓越的性能。我们的模型和代码可在https://github.com/gyxxyg/TRACE找到。
English
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. To effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. The TRACE processes visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation. Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are available at https://github.com/gyxxyg/TRACE.

Summary

AI-Generated Summary

PDF93November 16, 2024