先想像再預測：交錯潛在視覺推理用於視頻事件預測

摘要

視頻事件預測（VEP）要求模型從部分視頻證據推斷未觀測的未來狀態。現有視頻多模態大型語言模型通常將中間未來推理在文本空間中口語化：一旦視覺證據被口語化，精細的運動、幾何與交互線索可能丟失，導致看似合理但缺乏視覺基礎的幻覺。我們提出Future-L1，一種交錯潛在視覺推理框架，讓多模態大型語言模型在自迴歸解碼過程中在語言標記與連續潛在視覺跨度之間交替。為了訓練此能力，我們通過選取未來視覺提示有助於預測的示例，並將潛在狀態對齊到未來幀嵌入，構建Future-L1-50K資料集，再進一步利用LA-DAPO（一種具備結果對比與時間多樣性獎勵的潛在感知強化學習目標）優化取樣的潛在軌跡。Future-L1在兩個基準測試上均達到最新最佳結果：在FutureBench上，將Qwen3-VL-8B從61.0提升至85.4，並超越先前最佳Video-CoE達10.4個百分點；在TwiFF-Bench上，將平均分數從2.44提升至3.04。這些結果表明，面向未來的視頻推理得益於在潛在空間中保留中間視覺語義，而非將每一步推理都轉譯為文本。

English

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.