Video-CoE: 사건 연쇄를 통한 비디오 이벤트 예측 강화

초록

MLLM의 다양한 비디오 작업 적용에 있어 진전이 있었음에도 불구하고, 비디오 사건 예측(VEP)은 상대적으로 덜 탐구된 영역으로 남아 있습니다. VEP는 모델이 비디오에 대한 세밀한 시간적 모델링을 수행하고 비디오와 미래 사건 간의 논리적 관계를 설정해야 하는데, 현재의 MLLM들은 여전히 이에 어려움을 겪고 있습니다. 본 연구에서는 먼저 VEP 과제에 대한 현재 주류 MLLM들의 포괄적인 평가를 제시하며, 미래 사건 예측을 위한 논리적 추론 능력 부족 및 시각 정보 활용도 부족 등을 포함한 부정확한 예측의 원인을 밝힙니다. 이러한 과제를 해결하기 위해 우리는 시간적 사건 사슬을 구성하여 MLLM이 시각적 내용과 비디오-미래 사건 간 논리적 연결에 집중하도록 암묵적으로 유도하고, 다양한 훈련 프로토콜을 통해 모델의 추론 능력을 강화하는 CoE 패러다임을 제안합니다. 공개 벤치마크에 대한 실험 결과는 우리 방법이 주요 오픈소스 및 상용 MLLM들을 모두 능가하며 VEP 과제에서 새로운 최첨단 성능을确立함을 입증합니다. 코드와 모델은 곧 공개될 예정입니다.

English

Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose Chain of Events (CoE) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.

Video-CoE: 사건 연쇄를 통한 비디오 이벤트 예측 강화

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

초록

Support