다음 이벤트 예측을 통한 비디오 추론 능력 강화

초록

다음 토큰 예측(next-token prediction)은 대형 언어 모델(LLM)에서 추론 능력을 가능하게 하는 기본 학습 과제로 작용합니다. 하지만 비디오 입력에 대한 시간적 추론 능력을 갖춘 다중모달 대형 언어 모델(MLLM)을 개발하려면 어떤 학습 과제가 적합할까요? 기존의 비디오 질의응답(video question answering)과 같은 과제들은 종종 인간이나 훨씬 강력한 MLLM으로부터의 주석에 의존하는 반면, 비디오 캡셔닝(video captioning)은 시간적 추론을 공간 정보와 혼동하는 경향이 있습니다. 이러한 격차를 해결하기 위해, 우리는 미래 비디오 세그먼트를 풍부한 자기 지도 신호로 활용하여 시간적 추론을 촉진하는 학습 과제인 다음 이벤트 예측(next-event prediction, NEP)을 제안합니다. 각 비디오를 과거 프레임과 미래 프레임으로 분할하여, MLLM은 과거 프레임을 입력으로 받아 미래 프레임에서 도출된 이벤트 요약을 예측함으로써, 모델이 시간적으로 추론하도록 유도합니다. 이 과제를 지원하기 위해, 우리는 다양한 실제 시나리오를 아우르는 33,000개의 자동 추출 비디오 세그먼트로 구성된 V1-33K 데이터셋을 구축했습니다. 또한, 시간적 추론에 미치는 영향을 연구하기 위해 다양한 비디오 지시 튜닝(video instruction-tuning) 전략을 탐구합니다. 평가를 위해, 우리는 보이지 않는 미래 이벤트를 예측하는 데 있어 일관성을 평가하는 FutureBench를 도입했습니다. 실험 결과, NEP가 MLLM에서 시간적 추론을 촉진하기 위한 확장 가능하고 효과적인 훈련 패러다임을 제공함이 검증되었습니다.

English

Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.

다음 이벤트 예측을 통한 비디오 추론 능력 강화

Fostering Video Reasoning via Next-Event Prediction

초록

Support