ChatPaper.aiChatPaper

TRACE:通過因果事件建模進行時間定位的視頻LLM

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

October 8, 2024
作者: Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Qingbin Liu, Xi Chen
cs.AI

摘要

影片時間定位(VTG)是影片理解模型的關鍵能力,對於影片瀏覽和編輯等下游任務起著至關重要的作用。為了有效地同時處理各種任務並實現零樣本預測,目前越來越多地採用影片LLMs來進行VTG任務。然而,目前基於影片LLM的方法僅依賴自然語言生成,缺乏對影片固有結構的建模能力,這限制了它們在應對VTG任務方面的效果。為了解決這個問題,本文首先正式介紹因果事件建模框架,將影片表示為事件序列,並使用先前事件、影片輸入和文本指示來預測當前事件。每個事件包括三個組件:時間戳、顯著分數和文本標題。然後,我們提出了一種新型的任務交替影片LLM,稱為TRACE,以有效地實現因果事件建模框架。TRACE將視覺幀、時間戳、顯著分數和文本作為不同任務進行處理,為每個任務使用各種編碼器和解碼頭。任務標記根據因果事件建模框架的公式排列在交替序列中。對各種VTG任務和數據集的大量實驗表明,TRACE相對於最先進的影片LLM表現出優越性能。我們的模型和代碼可在https://github.com/gyxxyg/TRACE 上找到。
English
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. To effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. The TRACE processes visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation. Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are available at https://github.com/gyxxyg/TRACE.

Summary

AI-Generated Summary

PDF93November 16, 2024