影片事件鏈：透過事件鏈強化影片事件預測能力

摘要

儘管多模態大語言模型在各類影片任務中的應用已取得進展，但影片事件預測領域仍相對缺乏探索。該任務要求模型對影片進行細粒度的時序建模，並建立影片與未來事件間的邏輯關聯，而當前多模態大語言模型在此方面仍存在不足。本研究首先對主流多模態大語言模型在影片事件預測任務上的表現進行系統性評估，揭示了其預測不準確的根源：包括缺乏對未來事件的邏輯推理能力，以及對視覺資訊利用不足等問題。為應對這些挑戰，我們提出事件鏈範式，通過構建時序事件鏈隱式引導多模態大語言模型聚焦視覺內容與影片-未來事件間的邏輯關聯，並結合多種訓練機制激發模型的推理能力。在公開基準測試上的實驗結果表明，我們的方法優於當前領先的開源與商業多模態大語言模型，為影片事件預測任務樹立了新標竿。程式碼與模型即將開源發布。

English

Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose Chain of Events (CoE) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.

影片事件鏈：透過事件鏈強化影片事件預測能力

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

摘要

Support