Video-CoE：通过事件链强化视频事件预测

摘要

尽管多模态大语言模型（MLLMs）在各类视频任务中的应用已取得进展，但视频事件预测（VEP）领域仍相对缺乏深入探索。VEP要求模型对视频进行细粒度时序建模，并建立视频与未来事件间的逻辑关联，而当前MLLMs在此方面仍存在明显不足。本研究首次对主流MLLMs在VEP任务上的表现进行了系统评估，揭示了其预测失准的根源：包括对未来事件预测的逻辑推理能力缺失，以及视觉信息利用不充分等问题。为应对这些挑战，我们提出事件链（CoE）范式，通过构建时序事件链隐式引导MLLMs聚焦视频内容及其与未来事件的逻辑关联，并借助多重训练机制激发模型的推理能力。在公开基准上的实验结果表明，我们的方法超越了当前领先的开源及商用MLLMs，为VEP任务建立了新的技术标杆。代码与模型即将开源发布。

English

Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose Chain of Events (CoE) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.

Video-CoE：通过事件链强化视频事件预测

Video-CoE: Reinforcing Video Event Prediction via Chain of Events

摘要

Support