次イベント予測による映像推論の促進

要旨

次トークン予測は、大規模言語モデル（LLM）における推論を可能にする基盤となる学習タスクである。しかし、マルチモーダル大規模言語モデル（MLLM）にビデオ入力に対する時間的推論能力を備えさせる場合、どのような学習タスクを設定すべきだろうか？既存のタスクであるビデオ質問応答は、人間やより強力なMLLMからのアノテーションに依存することが多く、一方でビデオキャプショニングは時間的推論と空間情報を混在させがちである。このギャップを埋めるため、我々は次イベント予測（Next-Event Prediction, NEP）を提案する。これは、将来のビデオセグメントを豊富な自己教師あり信号として活用し、時間的推論を促進する学習タスクである。各ビデオを過去フレームと未来フレームに分割し、MLLMは過去フレームを入力として受け取り、未来フレームから導出されたイベントの要約を予測する。これにより、タスクを完了するためにモデルに時間的推論を行うことを促す。このタスクを支援するため、我々はV1-33Kというデータセットを構築した。これは、多様な実世界のシナリオにわたる33,000の自動抽出されたビデオセグメントから構成される。さらに、時間的推論への影響を調査するため、さまざまなビデオ指示チューニング戦略を探求する。進捗を評価するために、未見の未来イベントを予測する際の一貫性を評価するFutureBenchを導入する。実験により、NEPがMLLMにおける時間的推論を促進するためのスケーラブルで効果的なトレーニングパラダイムを提供することが検証された。

English

Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.

次イベント予測による映像推論の促進

Fostering Video Reasoning via Next-Event Prediction

要旨

Support