추적: 인과적 사건 모델링을 통한 시간적 지지 동영상 LLM

초록

비디오 시간적 그라운딩(VTG)은 비디오 이해 모델에 대한 중요한 능력이며 비디오 브라우징 및 편집과 같은 하류 작업에서 중요한 역할을 합니다. 다양한 작업을 효과적으로 동시에 처리하고 제로샷 예측을 가능하게 하기 위해 비디오 LLMs를 VTG 작업에 활용하는 추세가 증가하고 있습니다. 그러나 현재의 비디오 LLM 기반 방법은 자연어 생성에만 의존하며 비디오의 명확한 구조를 모델링하는 능력이 부족하여 VTG 작업을 다루는 데 효과적이지 못한 제약이 있습니다. 본 논문에서는 이 문제를 해결하기 위해 먼저 비디오를 사건 시퀀스로 표현하고 이전 사건, 비디오 입력 및 텍스트 지침을 사용하여 현재 사건을 예측하는 인과 사건 모델링 프레임워크를 형식적으로 소개합니다. 각 사건은 타임스탬프, 중요 점수 및 텍스트 캡션으로 구성됩니다. 그런 다음 실제로 인과 사건 모델링 프레임워크를 효과적으로 구현하기 위해 새로운 작업 간 비디오 LLM인 TRACE를 제안합니다. TRACE는 시각적 프레임, 타임스탬프, 중요 점수 및 텍스트를 각각 다른 작업으로 처리하며 각각에 대해 다양한 인코더와 디코더 헤드를 사용합니다. 작업 토큰은 인과 사건 모델링 프레임워크의 공식에 따라 교차되는 순서로 배열됩니다. 다양한 VTG 작업 및 데이터셋에서의 광범위한 실험 결과는 TRACE의 최신 비디오 LLM에 비해 우수한 성능을 보여줍니다. 저희 모델과 코드는 https://github.com/gyxxyg/TRACE에서 확인할 수 있습니다.

English

Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. To effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. The TRACE processes visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation. Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are available at https://github.com/gyxxyg/TRACE.

추적: 인과적 사건 모델링을 통한 시간적 지지 동영상 LLM

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

초록

Support