通过下一事件预测促进视频推理能力

摘要

下一令牌预测作为基础学习任务，使大型语言模型（LLMs）具备了推理能力。然而，当目标是为多模态语言模型（MLLMs）赋予处理视频输入的时间推理能力时，应选择何种学习任务？现有的视频问答任务往往依赖于人工标注或更强大的MLLMs，而视频描述则倾向于将时间推理与空间信息交织在一起。为填补这一空白，我们提出了下一事件预测（NEP），这是一种利用未来视频片段作为丰富自监督信号以促进时间推理的学习任务。我们将每段视频分割为过去帧和未来帧：MLLM以过去帧为输入，预测从未来帧中提取的事件摘要，从而激励模型进行时间推理以完成任务。为支持此任务，我们构建了V1-33K数据集，包含33,000个自动提取的视频片段，覆盖多样化的现实场景。我们进一步探索了一系列视频指令调优策略，研究它们对时间推理的影响。为评估进展，我们引入了FutureBench，用于评估预测未见未来事件的一致性。实验验证了NEP作为一种可扩展且有效的训练范式，能够促进MLLMs中的时间推理能力。

English

Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.

通过下一事件预测促进视频推理能力

Fostering Video Reasoning via Next-Event Prediction

摘要

Support